ClubReady Digital Insights Workshop
ClubReady Digital Insights Workshop
if(!require(easypackages)){install.packages("easypackages")}
library(easypackages)
packages("plyr", "dplyr", "data.table", "xda","ggplot2", "forcats", "readr", "gridExtra", "knitr", "onehot",
"tidyr", "purrr", "survival", prompt = FALSE)Abstract
ClubReady asked Valorem to assist in the development on machine learning algorithms to predict user churn and store churn. After an thorough exploratory data analysis, the data provided sufficient information to develop an algorithm that appears to predict store churn with relatively high accuracy. However, the data does not provide sufficient information to predict user churn.
Introduction
This document details the processes and outcomes from a Valorem Digital Insight Workshop to predict churn models.
Client Description
ClubReady LLC operates a Web-based club management platform for fitness facilities, fitness individuals and large corporate chains. It specializes in member management, billing, EFT and POS and sales process/CRM. The company was incorporated in 2014 and is based in St. Louis, Missouri.
Client Business Problem
ClubReady is the 3rd largest company in their industry. The company recognizes the need to leverage their data assets to gain deeper knowledge and understanding of customer behaviors. ClubReady has taken steps in this area to begin analyzing the significant data it captures through its membership software applications, and seeks support to further explore its data assets. ClubReady reached out to Valorem as a preferred Microsoft partner with expertise in Advanced Analytics, including Azure Machine Learning and the Cortana Intelligence Suite of Azure services.
ClubReady strives to become a data-first company focused on membership and member retention insights gained through an exploration of ClubReady data.
Client Engagement
On October 3, 2017, ClubReady entered into an agreement with Valorem statement for a Digital Insight Workshop. The workshop provided the following deliverables
- Data Assessment and Quality Report
- Digital Analytics Vision & Roadmap of Actionable Insights
- Documented Key Priorities
- Potential Digital Insights Program ROI
- Recommended and Prioritized Analytics Actions
- Analytics Roadmap and Estimates
- Presentation of Next Steps
As documented later in the report, these deliverables were collapsed into fewer deliverables as agreed during the on-site workshop meeting.
Digital Insights Process
Per the SOW, the Digital Insights Workshop Project flow includes:
- Pre-Work and Agenda
- On-Site Workshop
- Analytics Run
- Present Findings
Analytics Run is renamed in this document to Exploratory Data Analysis
Pre-Work and Agenda
On November 20, 2017, Valorem’s Project Manager led an introductory pre-workshop Skype Meeting. Andy Sweet, the ClubReady CTO and Project Sponsor participated with Valorem Data Scientists. The meeting consisted of:
- Team Introductions
- Project Overview
- Communications Plan
- Q & A
The core project Team Members were identified:
| Team Member | Company & Title |
|---|---|
| Andy Sweet | ClubReady CTO |
| Justin Trusty | ClubReady Data Architect |
| Lauren Crosby | ClubReady Director of Product Management |
| Matt Mercurio | ClubReady Director of Engineering |
| Brad Llewellyn | Valorem Data Scientist |
| Cliff Weaver | Valorem Data Scientist |
| Brian Roselli | Valorem Project Management |
On-Site Workshop
The on-site workshop was held at ClubReady on November 28 - 29. The agenda presented at the workshop included:
- Day 1
- ClubReady vision, goals, priorities
- What can machine learning do for you?
- Problem statements
- Redefine deliverables
- Churn Example
- Data Sources & Definitions
- Day 2
- Review previous day successes and misses
- Data Exploration
- Q&A
- Time Permitting (Clients Choice)
- Introduction to R
During the two-day workshop, the agenda was roughly followed. Regardless of the path, all the objectives to the workshop were accomplished:
- Understanding ClubReady goals and priorities
- Development of well-formed questions
- Review and Modification to Deliverables and Format
- Access to data
- Selection of data for analysis
Understanding ClubReady Goals & Priorities
At the start of the on-site workshop, Andy Sweet provided an overview of ClubReady, its FY18 Goals and the mission to become a data-first company believing this will provide a strategic advantage over its competitors.
Well-Formed Question Development
The success of any data science project starts with a well-formed question. A well-formed question is a prerequisite to a project because without it, the project is likely to fail. At a minimum, a well-formed question provides:
- A statement of the problem or issue that needs to be solved.
- Detailed description of the data available for analysis. This includes data that is not not available.
- A description of what ClubReady would like to be able to predict or categorize.
- How ClubReady will consume the output from this project (assuming the data supports an algorithmic solution).
Two questions were identified:
- Identify what clubs (full service and DIY) are likely to fail based on percentage drop in revenue. It was determined an 80% drop in revenue compared to the previous month(s) is the comparison metric. Note: Clubs were analysed and not separated into full service and DIY populations.
- Identify the individual customer propensity at the club level (lowest organization in the hierarchy) to terminate their club agreement. (This excludes one-time users - i.e., not club members.) Member Churn is defined by an individual customer agreement termination with no renewal within 30 days.
Review and Modification to Deliverables and Format
The Statement of Work provided the following outputs:
- Data Assessment and Quality Report
- Digital Analytics Vision & Roadmap of Actionable Insights
- Documented Key Priorities
- Potential Digital Insights Program ROI
- Recommended and Prioritized Analytics Actions
- Analytics Roadmap and Estimates
- Presentation of Next Steps
During the workshop, example documentation built using RMarkdown language saved in HTML format was presented. The ClubReady Team agreed this format was acceptable for all project reporting. (This document is the result of that agreement.)
Access to ClubReady Data
On the 2nd day of the workshop, Valorem was granted and verified access to ClubReady data.
Data Selection
ClubReady reviewed the tables and variables available in the ClubReady database (over 600 tables and 3,000+ fields). It was quickly realized a majority of the variables were sparsely populated. The Team implemented custom SQL code to identify the sparsity of each variable. The Team then identified the initial set of candidate data ClubReady believed to be important to answer the well-formed questions. Recognize the data available for analysis was a small subset of the all the database data.
Data Caveats & Assumptions
- There was no method to identify when data was populated. If a column was 60% dense, the field would not have been selected even if that variable is 100% dense in the last year or two. There may be useful data that was not used in this analysis.
- There is no guarantee all impactful variables were identified. Reviewing a large dataset with a small Team does not ensure all important variables were included in the data used for modeling.
- Team decided the last 2 years of data would be used for the project. It was in this period ClubReady experienced significant growth.
Exploratory Data Analysis
This section of the document presents the results of the Exploratory Data Analysis process. stores.csv data file is explored first in detail followed by ClassesService.csv and users.csv with less commentary content.
This document section is organized as follows:
- One of three CSV data files,
stores.csv, is explored with detail in the first section - In the following sections, the other two data files
(UserandClass & Services) are explored albeit with less explanatory content - All of these sections roughly follow the same thought process:
- Get Data
- Remove/transform NAs
- Character to Factors
- Explore factors with tables and plots
- Explore numerical data
- Explore variances
- Manage duplicate records
- Data Glimpse
Stores
Get Data
Working with ClubReady, Valorem developed SQL scripts capturing raw data ClubReady data in csv format. The SQL code will be provided upon request.
The stores data returns 42322, 1 records with 42322, 1 variables. Of the 135 variables, 42322 is a character type and 42322 are numeric.
Categorical Variables
Examine the categorical variables first:
charSummary(stores)## n miss miss% unique
## Status 42322 0 0 3
## AverageMinsPerClass_Service 42322 0 0 83
## AveragePricePerClass_Service 42322 0 0 772
## AverageCancellationHrsPerClass_Service 42322 0 0 36
## AverageRescheduleDeadlinePerClass_Service 42322 0 0 40
## AverageMinsPerClass 42322 0 0 76
## AveragePricePerClass 42322 0 0 495
## AverageCancellationHrsPerClass 42322 0 0 33
## AverageRescheduleDeadlinePerClass 42322 0 0 35
## AverageMinsPerService 42322 0 0 57
## AveragePricePerService 42322 0 0 471
## AverageCancellationHrsPerService 42322 0 0 35
## AverageRescheduleDeadlinePerService 42322 0 0 42
## UserEmployeeRatio 42322 0 0 1446
## UserEmployeeRatio_Last3Months 42322 0 0 1326
## top5levels:count
## Status Active:37262, Inactive:4397, Cancel:663
## AverageMinsPerClass_Service NULL:17693, 45:3120, 60:2182, 50:1775, 30:1597
## AveragePricePerClass_Service NULL:17693, 0.000000:3183, 45.000000:1087, 20.000000:531, 57.500000:293
## AverageCancellationHrsPerClass_Service NULL:19039, 24:9395, 0:2152, 1:1592, 12:1549
## AverageRescheduleDeadlinePerClass_Service NULL:19039, 24:7588, 1:2221, 3:1673, 12:1500
## AverageMinsPerClass NULL:22835, 60:5776, 45:1656, 30:1263, 52:972
## AveragePricePerClass NULL:22835, 0.000000:3212, 20.000000:811, 10.000000:684, 25.000000:522
## AverageCancellationHrsPerClass NULL:25858, 24:7273, 0:2713, 1:1893, 12:863
## AverageRescheduleDeadlinePerClass NULL:25858, 24:5249, 1:2587, 3:1741, 2:1401
## AverageMinsPerService NULL:20946, 45:4126, 40:1968, 60:1851, 50:1823
## AveragePricePerService NULL:20922, 0.000000:2916, 45.000000:1674, 30.000000:457, 60.000000:445
## AverageCancellationHrsPerService NULL:21008, 24:9905, 0:2002, 16:1703, 12:1655
## AverageRescheduleDeadlinePerService NULL:21008, 24:8306, 12:1636, 16:1589, 1:1484
## UserEmployeeRatio NULL:3274, 18:382, 21:346, 28:342, 23:338
## UserEmployeeRatio_Last3Months NULL:2923, 7:412, 21:393, 25:393, 16:389
We learn the only categorical variable is Status. Status is a factor with several levels.
stores <- stores %>% mutate_if(is.character, as.factor)
#charSummary(stores)
table(stores$Status)##
## Active Cancel Inactive
## 37262 663 4397
ggplot(stores, aes(fct_infreq(Status))) + geom_bar() + xlab("ClubReady Status Code") + theme(axis.text.x = element_text(angle = 45, hjust = 1))ClubReady provided direction on how to manage Active, Inactive, and Cancel:
Inactive is a store that is no longer with ClubReady. Most likely left to go to competitor. Cancel is a store that was in the process of being setup and for some reason stopped. Ignore cancel.
Remove records where Status is not either Active or Inactive.
stores <- filter(stores, Status == 'Active' | Status == 'Inactive')
table(stores$Status)##
## Active Cancel Inactive
## 37262 0 4397
stores$Status <- factor(stores$Status)
table(stores$Status)##
## Active Inactive
## 37262 4397
After removing the records associated with Status fields, 41659 records remain.
Numerical Data
Review the numerical data. Pay attention to the missing data percentage in shown in the right-most column below. (Note, only showing the first 20 of the 41659 numerical variables.
myNumSum <- numSummary(stores)[, c(1,7,8,16,17)]
myNumSum <- tibble::rownames_to_column(myNumSum)
names(myNumSum)[5] <- "missCNT"
names(myNumSum)[1] <- "Variable_Name"
myNumSum <- arrange(myNumSum, desc(n))
head(myNumSum, 20)## Variable_Name n nunique nzeros missCNT miss%
## 1 StoreId 41659 2616 0 0 0
## 2 IsChurn 41659 2 41078 0 0
## 3 IsChurn_2Months 41659 2 40026 0 0
## 4 IsChurn_3Months 41659 2 39602 0 0
## 5 IsChurn_6Months 41659 2 38505 0 0
## 6 SMSEnabled 41659 2 4 0 0
## 7 StoreAgeMonths 41659 119 290 0 0
## 8 Amenities 41659 28 30988 0 0
## 9 Integration_ABCCheckIn 41659 2 41224 0 0
## 10 Integration_ABCDataTrak 41659 2 41574 0 0
## 11 Integration_dotFit 41659 2 41259 0 0
## 12 Integration_FitBPO 41659 1 41659 0 0
## 13 Integration_fitRewards 41659 2 41252 0 0
## 14 Integration_Listen360 41659 2 30365 0 0
## 15 Integration_MotionSoft 41659 1 41659 0 0
## 16 Integration_Paramount 41659 1 41659 0 0
## 17 Integration_Perkville 41659 2 37633 0 0
## 18 Integration_PumpOne 41659 2 41634 0 0
## 19 Integration_ShareYourFit 41659 1 41659 0 0
## 20 Integration_VisualFitnessPlanner 41659 1 41659 0 0
Good, no missing data!
Examine the numerical data visually to illustrate interesting distributions.
cntNumNames <- length(select(select_if(stores,is.numeric), -StoreId))
#Make plots max 6 at a time - change if needed
maxPlot = 6
loopCnt <- cntNumNames %/% maxPlot
remainder <- cntNumNames %% maxPlot
myLoop_DF <- data.frame(x = seq(1, cntNumNames-remainder, by = maxPlot), y = seq(6, cntNumNames, by = maxPlot))
myLoopMax <- max(myLoop_DF)
for(i in 1:nrow(myLoop_DF)){
myplot <- select(select_if(stores,is.numeric), -StoreId)[myLoop_DF[i,1]:myLoop_DF[i,2]]%>% gather() %>% ggplot(aes(value)) +
facet_wrap(~ key, scales = "free") + geom_histogram() #+ geom_density()
print(myplot)
}Most of the plots above are not very interesting but there are a few that we might want to revisit individually including Amenities, Forms, and NonRequiredForms.
p1 <- ggplot(filter(stores, Amenities > 0), aes(Amenities)) + geom_bar() + ggtitle("Amenties > 0")
p2 <- ggplot(filter(stores, Forms > 0), aes(Amenities)) + geom_bar() + ggtitle("Forms > 0")
p3 <- ggplot(filter(stores, NonRequiredForms > 0), aes(Amenities)) + geom_bar() + ggtitle("NonRequiredForms > 0")
grid.arrange(p1, p2, p3, ncol=3)Variability
Evaluate how much variability there is in each numerical variable.
#Col 1 -s StoreId, 20 is Status
myVariance <- as.data.frame(apply(stores[,-c(1,20)], 2, var))
myVariance <- tibble::rownames_to_column(myVariance)
names(myVariance)[2] <- "Variance"
myVariance <- myVariance %>% mutate(Variance2 = ifelse(Variance == 0, "No", "Yes"))
table(myVariance$Variance2)##
## No Yes
## 7 110
Because 7 variables have no variance - all the values are the same, they can be removed from the working dataset. If there are no differences in a column, it is of no use in the development of an algorithm. The variables removed because there is no variance are:
VarNames <- myVariance %>% filter(Variance > 0) %>% select(rowname)
zeroVarNames <- myVariance %>% filter(Variance == 0) %>% select(rowname)
stores <- stores %>% select(StoreId, Status, unlist(VarNames))
zeroVarNames## rowname
## 1 Integration_FitBPO
## 2 Integration_MotionSoft
## 3 Integration_Paramount
## 4 Integration_ShareYourFit
## 5 Integration_VisualFitnessPlanner
## 6 IntegrationType_LandingPages_Marketing
## 7 IntegrationType_SocialMedia
Outlier Detection
In the working dataset, there is one variable, TotalRevenue to be evaluated for potential outliers. There are many way to visualize outliers. Boxplots are the most commonly used visualization.
out2 <- ggplot(stores, aes(x = "", y = TotalRevenue)) + geom_boxplot(outlier.color="red", outlier.shape=8, outlier.size=4) +
scale_y_continuous(labels = scales::dollar)There are potential outliers. ClubReady may want to review the legitmacy of these values.
Here is a list of the highest 25 TotalRevenue records:
tmpRev <- arrange(stores, desc(TotalRevenue)) %>% select(TotalRevenue)
tmpRev <- as.data.frame(head(scales::dollar(tmpRev$TotalRevenue), 25))
names(tmpRev) <- "Total_Revenue"
tmpRev## Total_Revenue
## 1 $9,155,556
## 2 $9,107,405
## 3 $8,044,618
## 4 $7,917,987
## 5 $7,695,146
## 6 $7,595,443
## 7 $7,553,951
## 8 $7,528,441
## 9 $7,150,346
## 10 $7,049,678
## 11 $6,973,129
## 12 $6,755,486
## 13 $6,742,283
## 14 $6,614,312
## 15 $6,604,380
## 16 $6,453,092
## 17 $6,298,918
## 18 $6,294,541
## 19 $6,193,235
## 20 $6,076,802
## 21 $5,976,849
## 22 $5,541,476
## 23 $5,444,348
## 24 $4,954,580
## 25 $4,687,897
Duplicates
Lastly, check for duplicate records. In this case, none are found because each record has a unique StoreId value. Without this variable, you would find 11 duplicates. (Interesting in its own right.)
# Duplicate Records
cat("The number of duplicated rows is", nrow(stores) - nrow(unique(stores)))## The number of duplicated rows is 0
# if(nrow(stores) - nrow(unique(stores)) > 0){
# head(stores[duplicated(stores),])
# stores <- stores[!duplicated(stores),]
# }Data Overview
The working dataset has been initially scrubbed and evaluated. Take time to review and learn about the data.
The plots below illustrate how often many of the ClubReady configuration options are turned on.
First examine the variable names that start with Integration.
myColTotal <- as.data.frame(colSums(Filter(is.numeric, stores)))
myColTotal <- tibble::rownames_to_column(myColTotal)
names(myColTotal)[1] <- "Variable_Name"
names(myColTotal)[2] <- "Sum_of_Variable"
myColTotal <- filter(myColTotal, !Variable_Name %in% c("StoreId", "TotalRevenue", "Amenities"))
myColTotal$Variable_Name <- as.factor(myColTotal$Variable_Name)
myColTotal <- myColTotal %>% arrange(desc(Sum_of_Variable))
myColTotal_Int <- filter(myColTotal, Variable_Name %like% "Integration")
ggplot(myColTotal_Int, aes(x=Variable_Name, y=Sum_of_Variable)) +
geom_bar(stat = "identity") + theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
xlab("Variable Name") + ylab("Number of Times Option Selected") + scale_x_discrete(limits= myColTotal_Int$Variable_Name)Six options appear to be much more popular than the others:
- Listen360
- Surveys
- Club Management System
- Rewards Program
- Perkville
- Data Trak
Below we find that most checkins occur between 8-11AM and 4-7PM local time. (Need to confirm)
myColTotal_checkins <- filter(myColTotal, Variable_Name %like% "Checkins_H" )
myColTotal_checkins <- myColTotal_checkins[!grepl("Months", myColTotal_checkins$Variable_Name),]
myColTotal_checkins$Variable_Name <- plyr::revalue(myColTotal_checkins$Variable_Name, c("Checkins_Hour0to1" = "0100",
"Checkins_Hour1to2" = "0200",
"Checkins_Hour2to3" = "0300", "Checkins_Hour3to4" = "0400", "Checkins_Hour4to5" = "0500",
"Checkins_Hour5to6" = "0600", "Checkins_Hour6to7" = "0700", "Checkins_Hour7to8" = "0800",
"Checkins_Hour8to9" = "0900", "Checkins_Hour9to10" = "1000", "Checkins_Hour10to11" = "1100",
"Checkins_Hour11to12" = "1200", "Checkins_Hour12to13" = "1300", "Checkins_Hour13to14" = "1400",
"Checkins_Hour14to15" = "1500", "Checkins_Hour15to16" = "1600", "Checkins_Hour16to17" = "1700",
"Checkins_Hour17to18" = "1800", "Checkins_Hour18to19" = "1900", "Checkins_Hour19to20" = "2000",
"Checkins_Hour20to21" = "2100", "Checkins_Hour21to22" = "2200", "Checkins_Hour22to23" = "2300",
"Checkins_Hour23to0" = "2400"))
myColTotal_checkins$Variable_Name <- factor(myColTotal_checkins$Variable_Name,
levels = c("0100", "0200", "0300", "0400", "0500", "0600", "0700", "0800", "0900", "1000", "1100", "1200",
"1300", "1400", "1500", "1600", "1700", "1800", "1900", "2000", "2100", "2200", "2300", "2400"))
ggplot(myColTotal_checkins, aes(x=Variable_Name, y=Sum_of_Variable)) +
geom_bar(stat = "identity") + theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
xlab("Variable Name") + ylab("Number of Times Option Selected")Lastly, Monday, Tuesday and Wednesday have the most checkins during the week.
myColTotal_dow <- myColTotal %>% filter(!Variable_Name %like% "Checkins_H", Variable_Name %like% "Checkins_")
myColTotal_dow$Variable_Name <- substring(myColTotal_dow$Variable_Name, 10, 10000)
myColTotal_dow$Variable_Name <- as.factor(myColTotal_dow$Variable_Name)
myColTotal_dow2 <- myColTotal_dow[!grepl("_", myColTotal_dow$Variable_Name),]
myColTotal_dow2_1 <- myColTotal_dow2[grepl("day", myColTotal_dow2$Variable_Name),]
myColTotal_dow2_1 <- myColTotal_dow2_1[!grepl("Weekday", myColTotal_dow2_1$Variable_Name),]
myColTotal_dow2_1$Variable_Name <- factor(myColTotal_dow2_1$Variable_Name, levels = c("Monday", "Tuesday", "Wednesday",
"Thursday", "Friday", "Saturday", "Sunday"))
ggplot(myColTotal_dow2_1, aes(x=Variable_Name, y=Sum_of_Variable)) +
geom_bar(stat = "identity") + theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
xlab("Variable Name") + ylab("Number of Times Option Selected")Data Glimpse
Here is what the resulting working dataset looks like. We are left with 41659 records and 112 (we began the journey with 5321 records and 88 variables).
glimpse(stores)## Observations: 41,659
## Variables: 112
## $ StoreId <int> 60, 105, 419, 206, 206, 3...
## $ Status <fctr> Active, Active, Inactive...
## $ IsChurn <int> 0, 1, 0, 0, 1, 0, 0, 0, 0...
## $ IsChurn_2Months <int> 0, 1, 0, 1, 1, 0, 0, 0, 0...
## $ IsChurn_3Months <int> 0, 1, 0, 1, 1, 0, 0, 0, 0...
## $ IsChurn_6Months <int> 0, 1, 1, 1, 1, 1, 0, 0, 1...
## $ SMSEnabled <int> 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ StoreAgeMonths <int> 92, 95, 76, 75, 77, 75, 8...
## $ Amenities <int> 0, 2, 0, 0, 0, 0, 0, 0, 0...
## $ Integration_ABCCheckIn <int> 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Integration_ABCDataTrak <int> 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Integration_dotFit <int> 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Integration_fitRewards <int> 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Integration_Listen360 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Integration_Perkville <int> 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ IntegrationType_ClubManagementSystem <int> 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ IntegrationType_RewardsProgram <int> 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ IntegrationType_Supplements_Wellness <int> 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ IntegrationType_Surveys <int> 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Forms <int> 2, 0, 0, 0, 0, 0, 1, 2, 0...
## $ RequiredForms <int> 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ NonRequiredForms <int> 2, 0, 0, 0, 0, 0, 1, 2, 0...
## $ NonRequiredWaiver <int> 1, 0, 0, 0, 0, 0, 1, 1, 0...
## $ NonRequiredHealthHistory <int> 1, 0, 0, 0, 0, 0, 0, 1, 0...
## $ RequiredWaiver <int> 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Checkins <int> 1029, 0, 0, 0, 0, 0, 614,...
## $ Checkins_Sunday <int> 1, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Checkins_Monday <int> 225, 0, 0, 0, 0, 0, 156, ...
## $ Checkins_Tuesday <int> 232, 0, 0, 0, 0, 0, 125, ...
## $ Checkins_Wednesday <int> 156, 0, 0, 0, 0, 0, 119, ...
## $ Checkins_Thursday <int> 216, 0, 0, 0, 0, 0, 115, ...
## $ Checkins_Friday <int> 160, 0, 0, 0, 0, 0, 81, 7...
## $ Checkins_Saturday <int> 39, 0, 0, 0, 0, 0, 18, 6,...
## $ Checkins_Weekend <int> 40, 0, 0, 0, 0, 0, 18, 6,...
## $ Checkins_Weekday <int> 989, 0, 0, 0, 0, 0, 596, ...
## $ Checkins_Hour0to1 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Checkins_Hour1to2 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Checkins_Hour2to3 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Checkins_Hour3to4 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Checkins_Hour4to5 <int> 1, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Checkins_Hour5to6 <int> 56, 0, 0, 0, 0, 0, 20, 45...
## $ Checkins_Hour6to7 <int> 25, 0, 0, 0, 0, 0, 43, 14...
## $ Checkins_Hour7to8 <int> 82, 0, 0, 0, 0, 0, 52, 24...
## $ Checkins_Hour8to9 <int> 74, 0, 0, 0, 0, 0, 38, 69...
## $ Checkins_Hour9to10 <int> 237, 0, 0, 0, 0, 0, 29, 9...
## $ Checkins_Hour10to11 <int> 115, 0, 0, 0, 0, 0, 35, 6...
## $ Checkins_Hour11to12 <int> 69, 0, 0, 0, 0, 0, 41, 32...
## $ Checkins_Hour12to13 <int> 32, 0, 0, 0, 0, 0, 42, 12...
## $ Checkins_Hour13to14 <int> 26, 0, 0, 0, 0, 0, 40, 9,...
## $ Checkins_Hour14to15 <int> 19, 0, 0, 0, 0, 0, 29, 17...
## $ Checkins_Hour15to16 <int> 14, 0, 0, 0, 0, 0, 40, 31...
## $ Checkins_Hour16to17 <int> 40, 0, 0, 0, 0, 0, 42, 41...
## $ Checkins_Hour17to18 <int> 110, 0, 0, 0, 0, 0, 50, 3...
## $ Checkins_Hour18to19 <int> 46, 0, 0, 0, 0, 0, 70, 38...
## $ Checkins_Hour19to20 <int> 50, 0, 0, 0, 0, 0, 36, 22...
## $ Checkins_Hour20to21 <int> 30, 0, 0, 0, 0, 0, 6, 0, ...
## $ Checkins_Hour21to22 <int> 3, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Checkins_Hour22to23 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Checkins_Hour23to0 <int> 0, 0, 0, 0, 0, 0, 1, 0, 0...
## $ Checkins_Morning <int> 658, 0, 0, 0, 0, 0, 258, ...
## $ Checkins_Afternoon <int> 241, 0, 0, 0, 0, 0, 243, ...
## $ Checkins_Evening <int> 129, 0, 0, 0, 0, 0, 112, ...
## $ Checkins_Night <int> 1, 0, 0, 0, 0, 0, 1, 0, 0...
## $ Checkins_Past3Months <int> 3160, 3543, 0, 0, 0, 0, 1...
## $ Checkins_Sunday_Past3Months <int> 11, 163, 0, 0, 0, 0, 0, 0...
## $ Checkins_Monday_Past3Months <int> 574, 684, 0, 0, 0, 0, 376...
## $ Checkins_Tuesday_Past3Months <int> 672, 643, 0, 0, 0, 0, 390...
## $ Checkins_Wednesday_Past3Months <int> 621, 758, 0, 0, 0, 0, 423...
## $ Checkins_Thursday_Past3Months <int> 665, 548, 0, 0, 0, 0, 354...
## $ Checkins_Friday_Past3Months <int> 484, 500, 0, 0, 0, 0, 228...
## $ Checkins_Saturday_Past3Months <int> 133, 247, 0, 0, 0, 0, 40,...
## $ Checkins_Weekend_Past3Months <int> 144, 410, 0, 0, 0, 0, 40,...
## $ Checkins_Weekday_Past3Months <int> 3016, 3133, 0, 0, 0, 0, 1...
## $ Checkins_Hour0to1_Past3Months <int> 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Checkins_Hour1to2_Past3Months <int> 0, 0, 0, 0, 0, 0, 1, 0, 0...
## $ Checkins_Hour2to3_Past3Months <int> 0, 0, 0, 0, 0, 0, 4, 0, 0...
## $ Checkins_Hour3to4_Past3Months <int> 1, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Checkins_Hour4to5_Past3Months <int> 9, 12, 0, 0, 0, 0, 0, 0, ...
## $ Checkins_Hour5to6_Past3Months <int> 127, 85, 0, 0, 0, 0, 94, ...
## $ Checkins_Hour6to7_Past3Months <int> 78, 162, 0, 0, 0, 0, 182,...
## $ Checkins_Hour7to8_Past3Months <int> 164, 198, 0, 0, 0, 0, 158...
## $ Checkins_Hour8to9_Past3Months <int> 194, 275, 0, 0, 0, 0, 146...
## $ Checkins_Hour9to10_Past3Months <int> 657, 248, 0, 0, 0, 0, 110...
## $ Checkins_Hour10to11_Past3Months <int> 395, 248, 0, 0, 0, 0, 98,...
## $ Checkins_Hour11to12_Past3Months <int> 236, 408, 0, 0, 0, 0, 112...
## $ Checkins_Hour12to13_Past3Months <int> 177, 307, 0, 0, 0, 0, 87,...
## $ Checkins_Hour13to14_Past3Months <int> 107, 202, 0, 0, 0, 0, 93,...
## $ Checkins_Hour14to15_Past3Months <int> 83, 134, 0, 0, 0, 0, 80, ...
## $ Checkins_Hour15to16_Past3Months <int> 84, 149, 0, 0, 0, 0, 106,...
## $ Checkins_Hour16to17_Past3Months <int> 142, 257, 0, 0, 0, 0, 132...
## $ Checkins_Hour17to18_Past3Months <int> 334, 237, 0, 0, 0, 0, 135...
## $ Checkins_Hour18to19_Past3Months <int> 161, 238, 0, 0, 0, 0, 189...
## $ Checkins_Hour19to20_Past3Months <int> 132, 160, 0, 0, 0, 0, 76,...
## $ Checkins_Hour20to21_Past3Months <int> 73, 116, 0, 0, 0, 0, 7, 0...
## $ Checkins_Hour21to22_Past3Months <int> 6, 102, 0, 0, 0, 0, 0, 0,...
## $ Checkins_Hour22to23_Past3Months <int> 0, 4, 0, 0, 0, 0, 0, 0, 0...
## $ Checkins_Hour23to0_Past3Months <int> 0, 1, 0, 0, 0, 0, 1, 0, 0...
## $ Checkins_Morning_Past3Months <int> 1851, 1624, 0, 0, 0, 0, 9...
## $ Checkins_Afternoon_Past3Months <int> 927, 1286, 0, 0, 0, 0, 63...
## $ Checkins_Evening_Past3Months <int> 372, 620, 0, 0, 0, 0, 272...
## $ Checkins_Night_Past3Months <int> 10, 13, 0, 0, 0, 0, 6, 0,...
## $ PurchasedSessions <int> 3414, 8, 484, 24, 0, 471,...
## $ PurchasedSessions_Past3Months <int> 14802, 432, 1731, 24, 46,...
## $ TotalRevenue <dbl> 42708, 240, 19992, 2700, ...
## $ TotalRevenue_Past3Months <dbl> 120245, 36470, 57193, 270...
## $ ActiveUsers <int> 2489, 4396, 0, 0, 0, 0, 7...
## $ ActiveUsers_Last3Months <int> 2519, 4413, 0, 0, 0, 0, 7...
## $ ActiveEmployees <int> 13, 4, 0, 0, 0, 0, 5, 13,...
## $ ActiveEmployees_Last3Months <int> 17, 16, 0, 0, 0, 0, 6, 17...
## $ Classes_Services <int> 12, 109, 0, 9, 9, 0, 22, ...
## $ Classes <int> 5, 66, 0, 1, 1, 0, 13, 5,...
## $ Services <int> 7, 43, 0, 8, 8, 0, 9, 7, ...
Classes & Services
Get Data
#Get Data
Classes_Services <- read_csv("C:/Users/cweaver/Downloads/Classes_Services.csv",
col_types = cols(AvailablePIF = col_integer(),
ByPerson = col_integer(), CanSeeInstructor = col_integer(),
CancellationHrs = col_integer(),
ClassMins = col_integer(), CustSelfBook = col_integer(),
Disabled = col_integer(), EmailReminders = col_integer(),
HalfHour = col_integer(), MultipleInstructors = col_integer(),
MustHaveCredit = col_integer(), MustPreBook = col_integer(),
NumPerClass = col_integer(), OnTheHour = col_integer(),
QuarterPast = col_integer(), QuarterTill = col_integer(),
RescheduleDeadline = col_integer(),
RuleCustomers = col_integer(), RuleFrontDesk = col_integer(),
SMSReminders = col_integer(), ServicesId = col_integer(),
ShowPublic = col_integer(), StandardPrice = col_integer()),
na = "NA")
myData <- Classes_Services
rm(Classes_Services)ClassesServices.csv returned 34055 records with 28 variables. Of the 28 variables, 1 is a character type and 27 are numeric.
glimpse(myData)## Observations: 34,055
## Variables: 28
## $ Type <chr> "Class", "Class", "Class", "Class", "Cla...
## $ ClassId <int> 2, 3, 33, 36, 37, 39, 41, 43, 44, 48, 51...
## $ ServicesId <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ StoreId <int> 1, 1, 11, 17, 20, 20, 20, 20, 20, 24, 20...
## $ Disabled <int> NA, NA, NA, NA, 1, 1, 1, NA, NA, NA, NA,...
## $ MustPreBook <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, NA, 1, 1, ...
## $ CustSelfBook <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, NA, 1, 1, ...
## $ MustHaveCredit <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ ClassMins <int> 45, 60, 45, 120, 60, 60, 60, 30, 30, 60,...
## $ NumPerClass <int> 6, 10, 9, 200, 10, 12, 40, 2, 4, 24, NA,...
## $ StandardPrice <int> 0, 15, NA, NA, NA, NA, NA, NA, 25, 10, 1...
## $ MultipleInstructors <int> 1, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ AvailablePIF <int> NA, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ EmailReminders <int> 1, 1, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ SMSReminders <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ CanSeeInstructor <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ ShowPublic <int> NA, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ CanDirectlyBookPublic <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ AllowWaitList <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0...
## $ ByPerson <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ OnTheHour <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ QuarterPast <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ HalfHour <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ QuarterTill <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ RuleFrontDesk <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ RuleCustomers <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ CancellationHrs <int> 12, 12, 24, 48, 24, 24, 24, 24, 24, 24, ...
## $ RescheduleDeadline <int> 12, 12, 24, 48, 12, 12, 12, 12, 12, 24, ...
Remove NA
Just as before, remove the NAs with 0s:
myData <- myData %>% mutate_if(is.integer, funs(replace(., is.na(.), 0)))
myData <- myData %>% mutate_if(is.double, as.integer)
glimpse(myData)## Observations: 34,055
## Variables: 28
## $ Type <chr> "Class", "Class", "Class", "Class", "Cla...
## $ ClassId <int> 2, 3, 33, 36, 37, 39, 41, 43, 44, 48, 51...
## $ ServicesId <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ StoreId <int> 1, 1, 11, 17, 20, 20, 20, 20, 20, 24, 20...
## $ Disabled <int> 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0...
## $ MustPreBook <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1...
## $ CustSelfBook <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1...
## $ MustHaveCredit <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ ClassMins <int> 45, 60, 45, 120, 60, 60, 60, 30, 30, 60,...
## $ NumPerClass <int> 6, 10, 9, 200, 10, 12, 40, 2, 4, 24, 0, ...
## $ StandardPrice <int> 0, 15, 0, 0, 0, 0, 0, 0, 25, 10, 150, 0,...
## $ MultipleInstructors <int> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1...
## $ AvailablePIF <int> 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ EmailReminders <int> 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0...
## $ SMSReminders <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ CanSeeInstructor <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ ShowPublic <int> 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ CanDirectlyBookPublic <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ AllowWaitList <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0...
## $ ByPerson <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ OnTheHour <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ QuarterPast <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ HalfHour <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ QuarterTill <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ RuleFrontDesk <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ RuleCustomers <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ CancellationHrs <int> 12, 12, 24, 48, 24, 24, 24, 24, 24, 24, ...
## $ RescheduleDeadline <int> 12, 12, 24, 48, 12, 12, 12, 12, 12, 24, ...
That looks much better!
Character to Factors
Change the variable Type from a character to a factor.
myData <- myData %>% mutate_if(is.character, as.factor)
class(myData$Type)## [1] "factor"
Explore Factors
charSummary(myData)## n miss miss% unique top5levels:count
## Type 34055 0 0 2 Class:17465, Service:16590
myData_factor <- myData %>% select_if(is.factor)
for(i in 1:length(myData_factor)){
print(names(myData_factor[i]))
print(table(myData_factor[i]))
}## [1] "Type"
##
## Class Service
## 17465 16590
for(i in 1:length(myData_factor)){
print(ggplot(myData_factor, aes_string(names(myData_factor[i]))) + geom_bar())
}Numerical Data
myNumSum <- numSummary(myData)[, c(1,7,8,16,17)]
myNumSum <- tibble::rownames_to_column(myNumSum)
names(myNumSum)[5] <- "missPCT"
names(myNumSum)[1] <- "Variable_Name"
myNumSum <- arrange(myNumSum, desc(missPCT))
head(myNumSum, 20)## Variable_Name n nunique nzeros missPCT miss%
## 1 ClassId 34055 17466 16590 0 0
## 2 ServicesId 34055 11109 17465 0 0
## 3 StoreId 34055 3142 0 0 0
## 4 Disabled 34055 2 27661 0 0
## 5 MustPreBook 34055 2 22355 0 0
## 6 CustSelfBook 34055 2 17846 0 0
## 7 MustHaveCredit 34055 2 25805 0 0
## 8 ClassMins 34055 35 1385 0 0
## 9 NumPerClass 34055 75 22266 0 0
## 10 StandardPrice 34055 193 19681 0 0
## 11 MultipleInstructors 34055 2 26495 0 0
## 12 AvailablePIF 34055 2 24489 0 0
## 13 EmailReminders 34055 2 24428 0 0
## 14 SMSReminders 34055 2 32101 0 0
## 15 CanSeeInstructor 34055 2 23596 0 0
## 16 ShowPublic 34055 2 18127 0 0
## 17 CanDirectlyBookPublic 34055 2 33340 0 0
## 18 AllowWaitList 34055 2 31613 0 0
## 19 ByPerson 34055 2 17466 0 0
## 20 OnTheHour 34055 2 17561 0 0
Good, no missing data!
Variances
myVariance <- as.data.frame(apply(myData[,-c(1)], 2, var))
myVariance <- tibble::rownames_to_column(myVariance)
names(myVariance)[2] <- "Variance"
myVariance <- myVariance %>% mutate(Variance2 = ifelse(Variance == 0, "No", "Yes"))
table(myVariance$Variance2)##
## Yes
## 27
All the variables have a variance > 0.
if(table(myVariance$Variance2)[1] > 0){
filter(myVariance, Variance2 == "No")
VarNames <- myVariance %>% filter(Variance > 0) %>% select(rowname)
myData <- myData %>% select(StoreId, unlist(VarNames))
}Duplicate Records
If duplicates are found, they will be removed.
cat("The number of duplicated rows is", nrow(myData) - nrow(unique(myData)))## The number of duplicated rows is 12
myData[duplicated(myData),]## # A tibble: 12 x 27
## StoreId ClassId ServicesId Disabled MustPreBook CustSelfBook
## <int> <int> <int> <int> <int> <int>
## 1 1507 0 6563 0 0 0
## 2 2785 0 18588 0 0 0
## 3 3167 0 18710 0 0 1
## 4 3167 0 18710 0 0 1
## 5 3950 0 20519 0 0 0
## 6 4096 0 20804 0 0 0
## 7 4097 0 20808 0 0 0
## 8 4098 0 20812 0 0 0
## 9 4099 0 20816 0 0 0
## 10 4010 0 23251 0 0 0
## 11 4127 0 24098 0 0 0
## 12 5897 0 24235 0 0 0
## # ... with 21 more variables: MustHaveCredit <int>, ClassMins <int>,
## # NumPerClass <int>, StandardPrice <int>, MultipleInstructors <int>,
## # AvailablePIF <int>, EmailReminders <int>, SMSReminders <int>,
## # CanSeeInstructor <int>, ShowPublic <int>, CanDirectlyBookPublic <int>,
## # AllowWaitList <int>, ByPerson <int>, OnTheHour <int>,
## # QuarterPast <int>, HalfHour <int>, QuarterTill <int>,
## # RuleFrontDesk <int>, RuleCustomers <int>, CancellationHrs <int>,
## # RescheduleDeadline <int>
if(nrow(myData) - nrow(unique(myData)) > 0){
head(myData[duplicated(myData),])
myData <- myData[!duplicated(myData),]
}Data Glimpse
glimpse(myData)## Observations: 34,055
## Variables: 27
## $ StoreId <int> 1, 1, 11, 17, 20, 20, 20, 20, 20, 24, 20...
## $ ClassId <int> 2, 3, 33, 36, 37, 39, 41, 43, 44, 48, 51...
## $ ServicesId <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Disabled <int> 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0...
## $ MustPreBook <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1...
## $ CustSelfBook <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1...
## $ MustHaveCredit <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ ClassMins <int> 45, 60, 45, 120, 60, 60, 60, 30, 30, 60,...
## $ NumPerClass <int> 6, 10, 9, 200, 10, 12, 40, 2, 4, 24, 0, ...
## $ StandardPrice <int> 0, 15, 0, 0, 0, 0, 0, 0, 25, 10, 150, 0,...
## $ MultipleInstructors <int> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1...
## $ AvailablePIF <int> 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ EmailReminders <int> 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0...
## $ SMSReminders <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ CanSeeInstructor <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ ShowPublic <int> 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ CanDirectlyBookPublic <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ AllowWaitList <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0...
## $ ByPerson <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ OnTheHour <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ QuarterPast <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ HalfHour <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ QuarterTill <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ RuleFrontDesk <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ RuleCustomers <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ CancellationHrs <int> 12, 12, 24, 48, 24, 24, 24, 24, 24, 24, ...
## $ RescheduleDeadline <int> 12, 12, 24, 48, 12, 12, 12, 12, 12, 24, ...
Users
Get Data
Users <- read_csv("C:/Users/cweaver/Downloads/Users.csv", col_types = cols(StoreId = col_integer()), progress = FALSE)
myData <- Users
rm(Users)Users.csv returned 17516305 records with 48 variables. Of the 48 variables, 2 is a character type and 46 are numeric.
glimpse(myData)## Observations: 17,516,305
## Variables: 48
## $ UserId <int> 1, 2, 3, 7, 8, 1059, 1169, 1172, 11...
## $ StoreId <int> NA, NA, 0, 0, NA, 1, 1, 1, 1, 1, 1,...
## $ CheckInCredentialsModified <int> 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0,...
## $ Gender <chr> "NULL", "NULL", "M", "M", "NULL", "...
## $ UserType <chr> "CrStaff", "CrStaff", "CrStaff", "D...
## $ Amenities <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Tags <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Checkins <int> 0, 0, 0, 0, 0, 0, 1, 0, 0, 9, 3, 1,...
## $ Checkins_Sunday <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0,...
## $ Checkins_Monday <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 1,...
## $ Checkins_Tuesday <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,...
## $ Checkins_Wednesday <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,...
## $ Checkins_Thursday <int> 0, 0, 0, 0, 0, 0, 1, 0, 0, 5, 0, 0,...
## $ Checkins_Friday <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,...
## $ Checkins_Saturday <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Checkins_Hour0to1 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Checkins_Hour1to2 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Checkins_Hour2to3 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Checkins_Hour3to4 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Checkins_Hour4to5 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Checkins_Hour5to6 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Checkins_Hour6to7 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Checkins_Hour7to8 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Checkins_Hour8to9 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0,...
## $ Checkins_Hour9to10 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Checkins_Hour10to11 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0,...
## $ Checkins_Hour11to12 <int> 0, 0, 0, 0, 0, 0, 1, 0, 0, 5, 0, 0,...
## $ Checkins_Hour12to13 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,...
## $ Checkins_Hour13to14 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,...
## $ Checkins_Hour14to15 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Checkins_Hour15to16 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Checkins_Hour16to17 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Checkins_Hour17to18 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Checkins_Hour18to19 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,...
## $ Checkins_Hour19to20 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,...
## $ Checkins_Hour20to21 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Checkins_Hour21to22 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Checkins_Hour22to23 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Checkins_Hour23to0 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ PurchasedSessions <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ TotalSpent <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Forms <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ RequiredForms <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ NonRequiredForms <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ NonRequiredWaiver <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ NonRequiredHealthHistory <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ RequiredWaiver <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ RequiredHealthHistory <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
Remove NA
There do not appear to be as many NAs as we have seen before. This time they appear prevalent in the StoreId variable. Also note NULL in the Gender field.
myData <- myData %>% mutate_if(is.integer, funs(replace(., is.na(.), 0)))#changes int to dbl
myData[,-41] <- myData %>% mutate_if(is.double, as.integer)#return to int
glimpse(myData$StoreId)## int [1:17516305] 0 0 0 0 0 1 1 1 1 1 ...
StoreId looks better now.
Character to Factors
Gender and UserType are character variables. Change them to factors.
myData <- myData %>% mutate_if(is.character, as.factor)
charSummary(myData)## n miss miss% unique
## Gender 17516304 1 5.71e-06 20
## UserType 17516305 0 0.00e+00 13
## top5levels:count
## Gender F:6181600, M:5638707, NULL:2834500, Female:1527379, Male:1321975
## UserType ClubClient:15676776, DeletedClubClient:1645260, ClubClientTemporary:63819, DeletedClubAdmin:62840, ClubAdmin:55389
myData_factor <- myData %>% select_if(is.factor)Note Gender has many missing values. These will be managed later.
Tables for the Factors
Examine the factors in the working dataset.
for(i in 1:length(myData_factor)){
print(names(myData_factor[i]))
print(table(myData_factor[i]))
}## [1] "Gender"
##
## " 1 Erkek f F female Female
## 1 6 2 2326 6181600 3700 1527379
## g Kadin m M male Male Masculino
## 1 1 2349 5638707 2734 1321975 1
## n NULL p u U
## 133 2834500 1 39 849
## [1] "UserType"
##
## CAS ClubAdmin ClubClient
## 5 55389 15676776
## ClubClientTemporary ClubTrainer CorpAdmin
## 63819 823 3256
## CrStaff DeletedClubAdmin DeletedClubClient
## 165 62840 1645260
## DeletedClubTrainer DeletedCorpAdmin DeletedCrStaff
## 2776 2263 164
## Duplicate User
## 2769
rm(myData_factor)There is much work to do on the Gender variable. Will also choose the appropriate UserType values for modeling.
User Factor - Gender
The values in Gender are varied. These will need to be collapsed into a couple of factor levels.
myData %>% group_by(Gender) %>% summarize(Unique_Values = n()) %>% arrange(desc(Unique_Values))## # A tibble: 20 x 2
## Gender Unique_Values
## <fctr> <int>
## 1 F 6181600
## 2 M 5638707
## 3 NULL 2834500
## 4 Female 1527379
## 5 Male 1321975
## 6 female 3700
## 7 male 2734
## 8 m 2349
## 9 f 2326
## 10 U 849
## 11 n 133
## 12 u 39
## 13 1 6
## 14 Erkek 2
## 15 "\"" 1
## 16 g 1
## 17 Kadin 1
## 18 Masculino 1
## 19 p 1
## 20 <NA> 1
ggplot(myData, aes(fct_infreq(Gender))) + geom_bar() + xlab(paste("ClubReady Subset - ", names(myData)[1])) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))levels(myData$Gender)[levels(myData$Gender) == "F"] <- "Female"
levels(myData$Gender)[levels(myData$Gender) == "female"] <- "Female"
levels(myData$Gender)[levels(myData$Gender) == "f"] <- "Female"
levels(myData$Gender)[levels(myData$Gender) == "M"] <- "Male"
levels(myData$Gender)[levels(myData$Gender) == "male"] <- "Male"
levels(myData$Gender)[levels(myData$Gender) == "m"] <- "Male"
myData %>% group_by(Gender) %>% summarize(Unique_Values = n()) %>% arrange(desc(Unique_Values))## # A tibble: 14 x 2
## Gender Unique_Values
## <fctr> <int>
## 1 Female 7715005
## 2 Male 6965765
## 3 NULL 2834500
## 4 U 849
## 5 n 133
## 6 u 39
## 7 1 6
## 8 Erkek 2
## 9 "\"" 1
## 10 g 1
## 11 Kadin 1
## 12 Masculino 1
## 13 p 1
## 14 <NA> 1
This looks better but there are still suspect values. We will remove:
- All the
Gendervalue counts that are small will be removed. This affects everything fromUand below in the table above. - It is reasonable to assume
Genderis an informative variable in churn modeling. Valorem will initially remove theNULLvalues fromGender.
myData <- filter(myData, Gender == "Female" | Gender == "Male")
myData$Gender <- factor(myData$Gender)
ggplot(myData, aes(fct_infreq(Gender))) + geom_bar() + xlab("ClubReady Subset - Gender") +
scale_y_continuous(labels = scales::comma)User Factor - UserType
myData %>% group_by(UserType) %>% summarize(Unique_Values = n()) %>% arrange(desc(Unique_Values))## # A tibble: 13 x 2
## UserType Unique_Values
## <fctr> <int>
## 1 ClubClient 13111028
## 2 DeletedClubClient 1425029
## 3 DeletedClubAdmin 56073
## 4 ClubAdmin 44420
## 5 ClubClientTemporary 32757
## 6 CorpAdmin 3221
## 7 DeletedClubTrainer 2773
## 8 DeletedCorpAdmin 2250
## 9 Duplicate User 2210
## 10 ClubTrainer 819
## 11 CrStaff 106
## 12 DeletedCrStaff 79
## 13 CAS 5
ggplot(myData, aes(fct_infreq(UserType))) + geom_bar() + xlab(paste("ClubReady Subset - ", names(myData)[2])) +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) + scale_y_continuous(labels = scales::comma)ClubReady confirmed Valorem to use:
- ClubClient
- DeletedClubClient
- ClubClientTemporary
myData <- filter(myData, UserType == 'ClubClient' | UserType == 'DeletedClubClient' | UserType == 'ClubClientTemporary')
myData$UserType <- factor(myData$UserType)
ggplot(myData, aes(fct_infreq(UserType))) + geom_bar() + xlab("ClubReady Subset - UserType") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) + scale_y_continuous(labels = scales::comma)Numerical Data
Because of the large number of
Userrecords, a random sample is selected in the code below.
myNumSum <- numSummary(sample_frac(myData, .3))[, c(1,7,8,16,17)]
myNumSum <- tibble::rownames_to_column(myNumSum)
names(myNumSum)[5] <- "missPCT"
names(myNumSum)[1] <- "Variable_Name"
myNumSum <- arrange(myNumSum, desc(missPCT))
head(myNumSum, 20)## Variable_Name n nunique nzeros missPCT miss%
## 1 UserId 4370644 4370644 0 0 0
## 2 StoreId 4370644 3528 2532 0 0
## 3 CheckInCredentialsModified 4370644 2 480413 0 0
## 4 Amenities 4370644 41 4221592 0 0
## 5 Tags 4370644 7 4349532 0 0
## 6 Checkins 4370644 1963 3218561 0 0
## 7 Checkins_Sunday 4370644 386 3922237 0 0
## 8 Checkins_Monday 4370644 495 3497281 0 0
## 9 Checkins_Tuesday 4370644 495 3499396 0 0
## 10 Checkins_Wednesday 4370644 490 3501510 0 0
## 11 Checkins_Thursday 4370644 463 3531767 0 0
## 12 Checkins_Friday 4370644 449 3614680 0 0
## 13 Checkins_Saturday 4370644 412 3756634 0 0
## 14 Checkins_Hour0to1 4370644 523 4335798 0 0
## 15 Checkins_Hour1to2 4370644 131 4358393 0 0
## 16 Checkins_Hour2to3 4370644 135 4361939 0 0
## 17 Checkins_Hour3to4 4370644 219 4360650 0 0
## 18 Checkins_Hour4to5 4370644 509 4334624 0 0
## 19 Checkins_Hour5to6 4370644 772 4235945 0 0
## 20 Checkins_Hour6to7 4370644 652 4185863 0 0
No missing data is in the working dataset.
Variance
#Do not include UserId, StoreId, Gender, UserType, TotalSpent
myVariance <- as.data.frame(apply(myData[,-c(1,2,4,5,41)], 2, var))
myVariance <- tibble::rownames_to_column(myVariance)
names(myVariance)[2] <- "Variance"
myVariance <- myVariance %>% mutate(Variance2 = ifelse(Variance == 0, "No", "Yes"))
table(myVariance$Variance2)##
## No Yes
## 1 42
Because 1 variables have no variance - all the values are the same, they can be removed from the working dataset. If there are no differences in a column, it is of no use in the development of an algorithm. The variables to be removed because there is no variance are:
if(table(myVariance$Variance2)[1] > 0){
VarNames <- myVariance %>% filter(Variance > 0) %>% select(rowname)
zeroVarNames <- myVariance %>% filter(Variance == 0) %>% select(rowname)
myData <- myData %>% select(UserId, StoreId, Gender, TotalSpent, unlist(VarNames))
zeroVarNames
}## rowname
## 1 RequiredWaiver
Outlier Detection
In the working dataset, there is one variable, TotalSpent that should be evaluated to identify any potential outlier. There are many way to visualize outliers. While boxplots are the most commonly used visualization, because the number of records is large, plotting is not an optimal reporting option - it takes a long time to plot millions of records.
Comparing the opposite ends of TotalSpent produces interesting information:
tmpRevDesc <- arrange(myData, desc(TotalSpent)) %>% select(TotalSpent)
tmpRevDesc <- tmpRevDesc[1:25,]
tmpRevDesc <- as.data.frame(scales::dollar(tmpRevDesc$TotalSpent))
names(tmpRevDesc) <- "Total_Spent"
tmpRevAsc <- arrange(myData, TotalSpent) %>% select(TotalSpent)
tmpRevAsc <- tmpRevAsc[1:25,]
tmpRevAsc <- as.data.frame(scales::dollar(tmpRevAsc$TotalSpent))
names(tmpRevAsc) <- "Total_Spent"
knitr::kable(list(tmpRevDesc, tmpRevAsc))
|
|
The highest TotalSPent value is [r tmpRevDesc[1,] and the lowest value is $-638,972.
It appears the large values may be associated with account credits. This may be a data quality issue.
Duplication
cat("The number of duplicated rows is", nrow(myData) - nrow(unique(myData)))## The number of duplicated rows is 0
if((nrow(myData) - nrow(unique(myData)))>0) myData[duplicated(myData),]if(nrow(myData) - nrow(unique(myData)) > 0){
head(myData[duplicated(myData),])
myData <- myData[!duplicated(myData),]
}Good news - no duplicate records.
Data Glimpse
Here is what the resulting working dataset looks like. We are left with 14568814 records and 46.
glimpse(myData)## Observations: 14,568,814
## Variables: 46
## $ UserId <int> 1169, 1172, 1178, 1187, 1188, 1191,...
## $ StoreId <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ Gender <fctr> Male, Male, Female, Male, Male, Ma...
## $ TotalSpent <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ CheckInCredentialsModified <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Amenities <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Tags <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Checkins <int> 1, 0, 0, 9, 3, 1, 0, 1, 0, 0, 0, 0,...
## $ Checkins_Sunday <int> 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0,...
## $ Checkins_Monday <int> 0, 0, 0, 2, 0, 1, 0, 0, 0, 0, 0, 0,...
## $ Checkins_Tuesday <int> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,...
## $ Checkins_Wednesday <int> 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0,...
## $ Checkins_Thursday <int> 1, 0, 0, 5, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Checkins_Friday <int> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Checkins_Saturday <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Checkins_Hour0to1 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Checkins_Hour1to2 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Checkins_Hour2to3 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Checkins_Hour3to4 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Checkins_Hour4to5 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Checkins_Hour5to6 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Checkins_Hour6to7 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Checkins_Hour7to8 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Checkins_Hour8to9 <int> 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0,...
## $ Checkins_Hour9to10 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Checkins_Hour10to11 <int> 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0,...
## $ Checkins_Hour11to12 <int> 1, 0, 0, 5, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Checkins_Hour12to13 <int> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,...
## $ Checkins_Hour13to14 <int> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Checkins_Hour14to15 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Checkins_Hour15to16 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Checkins_Hour16to17 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Checkins_Hour17to18 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Checkins_Hour18to19 <int> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Checkins_Hour19to20 <int> 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0,...
## $ Checkins_Hour20to21 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Checkins_Hour21to22 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Checkins_Hour22to23 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Checkins_Hour23to0 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ PurchasedSessions <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Forms <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ RequiredForms <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ NonRequiredForms <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ NonRequiredWaiver <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ NonRequiredHealthHistory <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ RequiredHealthHistory <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
Initial Data Modeling
This section walks through the Initial Data Modeling phase to identify candidate algorithms to predict different churn models for ClubReady. The section is organized as follows:
- One of two CSV data files for modelling,
stores.csv, is explored with detail in the first section - In the following sections, the other data file (Users) is explored albeit with less explanatory content
- All of these sections roughly follow the same thought process:
- Develop modeling data structure
- Create modeling data file and import to Azure Machine Learning Studio
- Perform Test/Training Data Split
- Determine “Optimal” Initial Model
All of the SQL used to develop these datasets can be found in the [ML].[ChurnScripts] stored procedure within the [Reports] database.
Stores
Develop Modeling Data Structure
What is Churn?
During the On-Site Workshop, ClubReady provided a basic definition of Store Churn. If a Store recognizes at least an 80% drop in Revenue over the course of a single month, then the Store is considered to be Churned.
We then need to answer What is Revenue? ClubReady provided the definition that Revenue is the total dollars invoiced to the Store by ClubReady. This information is available in the [PurchaseLog] table within the [ClubReady] database. Within this table, we utilize [StoreId], [PaymentMade] and [PurchaseAmount] columns to determine when the store has churned.
A single store cannot churn more than once within our time frame - a two-year time frame starting in November 2015 and ending in October 2017.
Dataset Granularity
The next important consideration is the granularity of the modelling dataset. Granularity defines what each row in the dataset represents. In general, the granularity of the modelling dataset should match the data as it is intended to be used in production. During the On-Site Workshop, this was discussed. The intention is to receive some type of notification that a store is likely to churn within the next few months. Therefore, a dataset where each row describes a single store for a single month and whether that store churned within that month or within the next few months is required. This allows ClubReady to run a monthly reporting or notification system that alerts them which stores are at risk of churning.
Create Modeling Data File
Using the information from the On-Site Workshop combined with what we learned in the Exploratory Data Analysis phase, the modeling dataset is created. The following subject areas within the ClubReady database were identified:
- Store Info (Status, Age)
- SMS
- Amenities
- Integrations
- Forms
- Checkins
- Purchased Sessions
- Revenue
- Active Users
- Active Employees
- Classes
- Services
The goal is to create a monthly “snapshot” of each store showing information about that store at that point in time. However, not all of these subject areas can be queried historically. This means that is no option but to take the data as it is today, assuming that it has not changed over time.
The following subject areas do not track historic information:
- Store info (Status)
- SMS
- Amenities
- Integrations
- Forms
- Classes
- Services
This leaves the following subject areas with historic information:
- Store Info (Age)
- Checkins
- Purchased Sessions
- Revenue
- Active Users
- Active Employees
To empower the models using time intelligence, metrics for these subject areas over the past three months were included. For instance, there are variables for [TotalRevenue_Past3Months] and [ActiveMembers_Past3Months].
This was all combined into a single dataset. The queries to create the Stores.csv file can be found in the [ML].[ClubChurn] and [ML].[CompleteStoreData] stored procedures in the [Reports] database.
Perform Test/Training Data Split
Similar to the “Modeling Data Structure” phase, the training and testing data should represent a production use case as closely as possible. ClubReady wants to be able to identify whether a store is likely to churn in the next few months. In this instance, three months was selected as the time interval. Since the last month in the dataset is October 2017, the testing month should be August 2017. The records for August 2017 contain information about the month of August, as well as the previous two months for the Past3Months variables. These records also contain a variable [IsChurn_3Months] describing whether the store will churn in the next three months. This is what the model attempts to predict.
In machine learning, it is important the model be tested using data that was not used to train it. For instance, if a store churns in August 2017, then that will be reflected in the [IsChurn_3Months] variable for the months of June and July as well.
The following training/testing split was developed:
- Training Set 1
- Start Month: January 2016
- End Month: May 2017
- Testing Set 1
- Start Month: August 2017
- End Month: August 2017
- Variable to Predict:
IsChurn_3Months
For additional testing, other splits were created:
- Training Set 2
- Start Month: January 2016
- End Month: April 2017
- Testing Set 2
- Start Month: July 2017
- End Month: July 2017
- Training Set 3
- Start Month: January 2016
- End Month: March 2017
- Testing Set 3
- Start Month: June 2017
- End Month: June 2017
Determine “Optimal” Initial Model
The Initial Modeling phase has two goals.
Determine how much of a “pattern” is easily accessible within the data. For instance, if the best model created in this phase falls substantially below what we would consider “reasonable”, then it may decided more or different data is required.
If patterns exist, then use the information to point towards the optimal way to improve the model using feature engineering or additional model optimization.
Selecting an Evaluation Metric
The selection of an evaluation metric is one of the most crucial aspects of the data modelling process. Recognizing that we are in a Binary Classification situation, we are limited to two main contenders, Accuracy/AUC and Precision/Recall. In general, Accuracy/AUC are acceptable when the two classes, “Churn” and “Not Churn” in this case, are approximately equally likely. However, in cases where one class is substantially more likely than another, Precision/Recall is the standard.
finalstoredata <- read_csv("C:/Users/cweaver/Downloads/Stores_2017_12_11.csv", col_types = cols())
churn <- finalstoredata["IsChurn_3Months"] %>%
group_by(IsChurn_3Months) %>%
summarise(count=n()) %>%
mutate(perc=count/sum(count))
ggplot(churn, aes(x = factor(IsChurn_3Months), y = perc*100)) + geom_bar(stat="identity") + ggtitle("IsChurn_3Months") + xlab("IsChurn_3Months") + ylab("Percentage") + geom_text(aes(x = factor(IsChurn_3Months), y = perc*100, label = paste(round(perc*100,0),"%")), nudge_y = 4)Since [IsChurn_3Months] variable is heavily imbalanced, Precision/Recall will be used as the performance metric. Precision is the percentage of correct Churn predictions out of all Churn predictions. Conversely, Recall is the percentage of correct Churn predictions out of all actual Churn records. Recall tells how many of the actual Churns being predicted while Precision shows how many of the predictions are actually Churns. Both of these metrics tell different stories and are very important. Maximizing both of them is ideal. Therefore, an evaluation metric of Precision * Recall is used.
The “Kitchen Sink” Approach
Azure Machine Learning Studio contains fifteen distinct Binary Classification algorithms. Fourteen of these can utilize a module called “Tune Model Hyper-parameters”. This module will allow us to train and test multiple machine learning algorithms using different sets of hyper-parameters. Therefore, using all fourteen modules, it is possible to get insights quickly and easily. The process looks like this:
The full results of this process can be found in the Store Churn Data Modelling Results.xlsx file. Here are the top three models for predicting Store Churn:
| Model Family | Hyper-parameters | Precision | Recall |
|---|---|---|---|
| Boosted Decision Tree | Leaves: 54 ~ Minimum Instances: 19 ~ Learning Rate: 0.336396 ~ Trees: 51 | 89.3% | 38.4% |
| Boosted Decision Tree | Leaves: 5 ~ Minimum Instances: 37 ~ Learning Rate: 0.030064 ~ Trees: 362 | 100.0% | 30.8% |
| Boosted Decision Tree | Leaves: 6 ~ Minimum Instances: 15 ~ Learning Rate: 0.356604 ~ Trees: 482 | 95.2% | 30.8% |
These results are promising. There is evidence there is a pattern within the data that may be used to predict when a Store is likely to churn.
Users
Develop Modeling Data Structure
What is Churn?
During the On-Site Workshop, ClubReady provided a definition of User Churn. If a User’s contract expires and is not renewed by the end of the next month, then that User has churned. This information is available in the [ContractPurchases] table within the [ClubReady] database. Within this table, [UserId], [ActivationDateUTC], [AgreedDate] and [Cancelled] columns are used.
Similar to Store Churn, a single user to cannot churn more than once within the time frame.
Dataset Granularity
The granularity of the User Churn data is per user, per month, for every month that the user has an active contract, plus an additional record for the month after their contract ends. This is the “Churn” month.
Create Modeling Data File
Using the information from the On-Site Workshop combined with what was learned in the Exploratory Data Analysis phase, a User modeling dataset can be developed. The following subject areas within the ClubReady database were identified:
- Store Churn Data
- User Info (Gender, Type, CheckinCredentialsModified)
- Amenities
- Tags
- Checkins
- Forms
- Purchased Sessions
- Spend
The goal is to create a monthly “snapshot” of each user showing information about that user at that point in time. However, not all of these subject areas can be queried historically.
The following subject areas do not track historic information:
- Store Churn Data (see previous section)
- User Info (Gender, Type, CheckinCredentialsModified)
- Amenities
- Tags
- Forms
This leaves the following subject areas with historic information:
- Store Churn Data (see previous section)
- Checkins
- Purchased Sessions
- Spend
To empower the models using time intelligence, additional variables for [TotalSpend_Past3Months] and [Checkins_Past3Months] were created.
All of this information was combined into a single dataset. The queries to create the User.csv file can be found in the [ML].[MemberChurn] and [ML].[CompleteUserData] stored procedures in the [Reports] database.
Perform Test/Training Data Split
The logic for the test/training split is identical to that of “Store Churn”.
- Training Set 1
- Start Month: January 2016
- End Month: May 2017
- Testing Set 1
- Start Month: August 2017
- End Month: August 2017
- Variable to Predict:
IsChurn_3Months
For additional testing, we can also move backwards any number of months to create new training/testing splits as follows:
- Training Set 2
- Start Month: January 2016
- End Month: April 2017
- Testing Set 2
- Start Month: July 2017
- End Month: July 2017
- Training Set 3
- Start Month: January 2016
- End Month: March 2017
- Testing Set 3
- Start Month: June 2017
- End Month: June 2017
Determine “Optimal” Initial Model
Selecting an Evaluation Metric
A major difference between the Store and User data is the overall size of the data. The [FinalUserData] table is 24GB as a CSV. Randomly selected sets of 1M, 5M, 10M and 20M rows were created
finaluserdata <- read_csv("C:/Users/cweaver/Downloads/Users_1M_2017_12_12.csv", col_types = cols(
Store_UserEmployeeRatio = col_double(),
Store_UserEmployeeRatio_Last3Months = col_double(),
Store_AverageMinsPerClass_Service = col_double(),
Store_AveragePricePerClass_Service = col_double(),
Store_AverageCancellationHrsPerClass_Service = col_double(),
Store_AverageRescheduleDeadlinePerClass_Service = col_double(),
Store_AverageMinsPerClass = col_double(),
Store_AveragePricePerClass = col_double(),
Store_AverageCancellationHrsPerClass = col_double(),
Store_AverageRescheduleDeadlinePerClass = col_double(),
Store_AverageMinsPerService = col_double(),
Store_AveragePricePerService = col_double(),
Store_AverageCancellationHrsPerService = col_double(),
Store_AverageRescheduleDeadlinePerService = col_double())
,progress = FALSE
)
churn <- finaluserdata["IsChurn_3Months"] %>% group_by(IsChurn_3Months) %>% summarise(count=n()) %>% mutate(perc=count/sum(count))
ggplot(churn, aes(x = factor(IsChurn_3Months), y = perc*100)) + geom_bar(stat="identity") + ggtitle("IsChurn_3Months") + xlab("IsChurn_3Months") + ylab("Percentage") + geom_text(aes(x = factor(IsChurn_3Months), y = perc*100, label = paste(round(perc*100,0),"%")), nudge_y = 4)The “Kitchen Sink” Approach
The full results of this process can be found in User Churn Data Modelling Results.xlsx file. Here are the top three models for predicting Store Churn:
| Model Family | Hyper-parameters | Precision | Recall |
|---|---|---|---|
| Boosted Decision Tree | Leaves: 32 ~ Minimum Instances: 6 ~ Learning Rate: 0.252098531 ~ Trees: 270 | 57.1% | 16.5% |
| Boosted Decision Tree | Leaves: 54 ~ Minimum Instances: 19 ~ Learning Rate: 0.336396247 ~ Trees: 51 | 55.0% | 15.9% |
| Boosted Decision Tree | Leaves: 17 ~ Minimum Instances: 13 ~ Learning Rate: 0.06286619 ~ Trees: 50 | 54.5% | 12.6% |
The Precision and Recall of these metrics are quite low. Other methods are required to develop a feasible model - if the data allows an algorithmic solution.
Further Investigation
The datasets used to develop the initial data models contained 220 predictors and either 1, 5 or 10 million records. Each dataset provided similar results. Therefore, it may be worth considering ways to reduce the number of predictors in the model. It is possible 220 predictors with a large number of zeros is simply overwhelming the modeling algorithms leading to poor models. Therefore, reducing the number of predictors without losing much information, may improve algorithm performance. A common technique for this is Principal Components Analysis (PCA). Before PCA can be performed, the dataset must be cleaned by removing non-predictive columns.
ignore.set <- c("UserId", "StoreId", "ReferenceMonthStartDate","IsChurn", "IsChurn_2Months", "IsChurn_3Months", "IsChurn_6Months")
disp <- data.frame(ignore.set)
names(disp) <- "Variables Ignored for Business Reasons"
disp## Variables Ignored for Business Reasons
## 1 UserId
## 2 StoreId
## 3 ReferenceMonthStartDate
## 4 IsChurn
## 5 IsChurn_2Months
## 6 IsChurn_3Months
## 7 IsChurn_6Months
Next, remove any variables with missing values, as PCA does not allow for these.
finaluserdata.miss <- finaluserdata[,-which(names(finaluserdata) %in% ignore.set)]
myNumSum <- numSummary(finaluserdata.miss)[, c(1,7,8,16,17)]
myNumSum <- tibble::rownames_to_column(myNumSum)
names(myNumSum)[5] <- "missCNT"
names(myNumSum)[1] <- "Variable_Name"
myNumSum <- arrange(myNumSum, desc(n))
ignore.miss <- myNumSum[myNumSum[,5] > 0,1]
disp <- data.frame(ignore.miss)
names(disp) <- c("Variables with Missing Values")
disp## Variables with Missing Values
## 1 Store_UserEmployeeRatio_Last3Months
## 2 Store_UserEmployeeRatio
## 3 Store_AverageMinsPerClass_Service
## 4 Store_AveragePricePerClass_Service
## 5 Store_AverageCancellationHrsPerClass_Service
## 6 Store_AverageRescheduleDeadlinePerClass_Service
## 7 Store_AverageCancellationHrsPerService
## 8 Store_AverageRescheduleDeadlinePerService
## 9 Store_AveragePricePerService
## 10 Store_AverageMinsPerService
## 11 Store_AverageMinsPerClass
## 12 Store_AveragePricePerClass
## 13 Store_AverageCancellationHrsPerClass
## 14 Store_AverageRescheduleDeadlinePerClass
Remove variables that have no variance, as these are not useful for predictive modeling
finaluserdata.var <- finaluserdata.miss[,-which(names(finaluserdata.miss) %in% ignore.miss)]
ignore.var <- names(which(round(apply(finaluserdata.var, MARGIN = 2, FUN = var), 4)<= 0))
disp <- data.frame(ignore.var)
names(disp) <- "Variables with No Variance"
disp## Variables with No Variance
## 1 NonRequiredForms
## 2 NonRequiredWaiver
## 3 NonRequiredHealthHistory
## 4 RequiredHealthHistory
## 5 Store_SMSEnabled
## 6 Store_Integration_FitBPO
## 7 Store_Integration_MotionSoft
## 8 Store_Integration_Paramount
## 9 Store_Integration_ShareYourFit
## 10 Store_Integration_VisualFitnessPlanner
## 11 Store_IntegrationType_LandingPages_Marketing
## 12 Store_IntegrationType_SocialMedia
Perform One-Hot Processing to change remaining categorical variables into numeric ones, a PCA requirement.
The output from this code is excessive so the results are not shown.
finaluserdata.enc <- finaluserdata.var[,-which(names(finaluserdata.var) %in% ignore.var)]
ignore.char <- row.names(charSummary(finaluserdata.enc))
encoder <- onehot(finaluserdata.enc[,which(names(finaluserdata.enc) %in% ignore.char)], stringsAsFactors = TRUE)
dat.encoded <- predict(encoder, finaluserdata.enc)
finaluserdata.encoded <- data.frame(finaluserdata.enc[,-which(names(finaluserdata.enc) %in% ignore.char)], dat.encoded)
head(finaluserdata.encoded)Perform PCA to reduce our data to a more manageable number of variables. Start by looking at the percentage of total information contained within each of the new variables, known as principal Components.
The results of the first 13 ordered descending records are shown.
pca <- princomp(finaluserdata.encoded)
info <- pca$sdev / sum(pca$sdev)
disp <- data.frame(paste(round(info*100,1), "%"))
names(disp) <- "Information Percentage by Component"
row.names(disp) <- names(info)
head(disp, 15)## Information Percentage by Component
## Comp.1 89.8 %
## Comp.2 5.1 %
## Comp.3 1.9 %
## Comp.4 1.8 %
## Comp.5 0.4 %
## Comp.6 0.2 %
## Comp.7 0.1 %
## Comp.8 0.1 %
## Comp.9 0.1 %
## Comp.10 0.1 %
## Comp.11 0 %
## Comp.12 0 %
## Comp.13 0 %
## Comp.14 0 %
## Comp.15 0 %
This highlights a significant reason why the User Churn models were not very effective. The 199 predictors evaluated by PCA could be nearly entirely replaced by a few dense variables. In other words, the data did not contain very much “information”.
info.cum <- cumsum(info)
disp <- data.frame(paste(round(info*100,1), "%"))
names(disp) <- c("Cumulative Information Percentage by Component")
rownames(disp) <- names(info.cum)
disp## Cumulative Information Percentage by Component
## Comp.1 89.8 %
## Comp.2 5.1 %
## Comp.3 1.9 %
## Comp.4 1.8 %
## Comp.5 0.4 %
## Comp.6 0.2 %
## Comp.7 0.1 %
## Comp.8 0.1 %
## Comp.9 0.1 %
## Comp.10 0.1 %
## Comp.11 0 %
## Comp.12 0 %
## Comp.13 0 %
## Comp.14 0 %
## Comp.15 0 %
## Comp.16 0 %
## Comp.17 0 %
## Comp.18 0 %
## Comp.19 0 %
## Comp.20 0 %
## Comp.21 0 %
## Comp.22 0 %
## Comp.23 0 %
## Comp.24 0 %
## Comp.25 0 %
## Comp.26 0 %
## Comp.27 0 %
## Comp.28 0 %
## Comp.29 0 %
## Comp.30 0 %
## Comp.31 0 %
## Comp.32 0 %
## Comp.33 0 %
## Comp.34 0 %
## Comp.35 0 %
## Comp.36 0 %
## Comp.37 0 %
## Comp.38 0 %
## Comp.39 0 %
## Comp.40 0 %
## Comp.41 0 %
## Comp.42 0 %
## Comp.43 0 %
## Comp.44 0 %
## Comp.45 0 %
## Comp.46 0 %
## Comp.47 0 %
## Comp.48 0 %
## Comp.49 0 %
## Comp.50 0 %
## Comp.51 0 %
## Comp.52 0 %
## Comp.53 0 %
## Comp.54 0 %
## Comp.55 0 %
## Comp.56 0 %
## Comp.57 0 %
## Comp.58 0 %
## Comp.59 0 %
## Comp.60 0 %
## Comp.61 0 %
## Comp.62 0 %
## Comp.63 0 %
## Comp.64 0 %
## Comp.65 0 %
## Comp.66 0 %
## Comp.67 0 %
## Comp.68 0 %
## Comp.69 0 %
## Comp.70 0 %
## Comp.71 0 %
## Comp.72 0 %
## Comp.73 0 %
## Comp.74 0 %
## Comp.75 0 %
## Comp.76 0 %
## Comp.77 0 %
## Comp.78 0 %
## Comp.79 0 %
## Comp.80 0 %
## Comp.81 0 %
## Comp.82 0 %
## Comp.83 0 %
## Comp.84 0 %
## Comp.85 0 %
## Comp.86 0 %
## Comp.87 0 %
## Comp.88 0 %
## Comp.89 0 %
## Comp.90 0 %
## Comp.91 0 %
## Comp.92 0 %
## Comp.93 0 %
## Comp.94 0 %
## Comp.95 0 %
## Comp.96 0 %
## Comp.97 0 %
## Comp.98 0 %
## Comp.99 0 %
## Comp.100 0 %
## Comp.101 0 %
## Comp.102 0 %
## Comp.103 0 %
## Comp.104 0 %
## Comp.105 0 %
## Comp.106 0 %
## Comp.107 0 %
## Comp.108 0 %
## Comp.109 0 %
## Comp.110 0 %
## Comp.111 0 %
## Comp.112 0 %
## Comp.113 0 %
## Comp.114 0 %
## Comp.115 0 %
## Comp.116 0 %
## Comp.117 0 %
## Comp.118 0 %
## Comp.119 0 %
## Comp.120 0 %
## Comp.121 0 %
## Comp.122 0 %
## Comp.123 0 %
## Comp.124 0 %
## Comp.125 0 %
## Comp.126 0 %
## Comp.127 0 %
## Comp.128 0 %
## Comp.129 0 %
## Comp.130 0 %
## Comp.131 0 %
## Comp.132 0 %
## Comp.133 0 %
## Comp.134 0 %
## Comp.135 0 %
## Comp.136 0 %
## Comp.137 0 %
## Comp.138 0 %
## Comp.139 0 %
## Comp.140 0 %
## Comp.141 0 %
## Comp.142 0 %
## Comp.143 0 %
## Comp.144 0 %
## Comp.145 0 %
## Comp.146 0 %
## Comp.147 0 %
## Comp.148 0 %
## Comp.149 0 %
## Comp.150 0 %
## Comp.151 0 %
## Comp.152 0 %
## Comp.153 0 %
## Comp.154 0 %
## Comp.155 0 %
## Comp.156 0 %
## Comp.157 0 %
## Comp.158 0 %
## Comp.159 0 %
## Comp.160 0 %
## Comp.161 0 %
## Comp.162 0 %
## Comp.163 0 %
## Comp.164 0 %
## Comp.165 0 %
## Comp.166 0 %
## Comp.167 0 %
## Comp.168 0 %
## Comp.169 0 %
## Comp.170 0 %
## Comp.171 0 %
## Comp.172 0 %
## Comp.173 0 %
## Comp.174 0 %
## Comp.175 0 %
## Comp.176 0 %
## Comp.177 0 %
## Comp.178 0 %
## Comp.179 0 %
## Comp.180 0 %
## Comp.181 0 %
## Comp.182 0 %
## Comp.183 0 %
## Comp.184 0 %
## Comp.185 0 %
## Comp.186 0 %
## Comp.187 0 %
## Comp.188 0 %
## Comp.189 0 %
## Comp.190 0 %
## Comp.191 0 %
## Comp.192 0 %
## Comp.193 0 %
## Comp.194 0 %
## Comp.195 0 %
## Comp.196 0 %
## Comp.197 0 %
## Comp.198 0 %
## Comp.199 0 %
head(disp, 15)## Cumulative Information Percentage by Component
## Comp.1 89.8 %
## Comp.2 5.1 %
## Comp.3 1.9 %
## Comp.4 1.8 %
## Comp.5 0.4 %
## Comp.6 0.2 %
## Comp.7 0.1 %
## Comp.8 0.1 %
## Comp.9 0.1 %
## Comp.10 0.1 %
## Comp.11 0 %
## Comp.12 0 %
## Comp.13 0 %
## Comp.14 0 %
## Comp.15 0 %
It appears only the first five components are needed. Use this information in Azure Machine Learning Studio to rerun the Initial Data Modeling step using the new dataset containing the five principal components, as well as the columns containing the missing data, as they could still provide value. Here are the results:
Original Dataset (220 Predictors)
| Model Family | Hyper-parameters | Precision | Recall |
|---|---|---|---|
| Boosted Decision Tree | Leaves: 32 ~ Minimum Instances: 6 ~ Learning Rate: 0.252098531 ~ Trees: 270 | 57.1% | 16.5% |
| Boosted Decision Tree | Leaves: 54 ~ Minimum Instances: 19 ~ Learning Rate: 0.336396247 ~ Trees: 51 | 55.0% | 15.9% |
| Boosted Decision Tree | Leaves: 17 ~ Minimum Instances: 13 ~ Learning Rate: 0.06286619 ~ Trees: 50 | 54.5% | 12.6% |
PCA Dataset (12 Predictors + 5 Principal Components)
| Model Family | Hyper-parameters | Precision | Recall |
|---|---|---|---|
| Neural Network - Gaussian Normalizer | Learning Rate: 0.034618 ~ Loss Function: Cross Entropy ~ Iterations: 29 | 15.0% | 2.0% |
| Neural Network - Gaussian Normalizer | Learning Rate: 0.030355 ~ Loss Function: Cross Entropy ~ Iterations: 27 | 18.2% | 1.5% |
| Neural Network - Binning Normalizer | Learning Rate: 0.037861 ~ Loss Function: Cross Entropy ~ Iterations: 129 | 20.3% | 1.1% |
Using PCA on this dataset is not beneficial.
Feature Selection
This section walks through the Feature Selection phase to determine which variables are important to the models we identified in the Initial Data Modeling phase.
- One of two models,
Stores, will be examined in the first section. - In the following sections, the other model,
Usersis explored, albeit with less explanatory content.
Stores
It is known there are patterns in the Stores data to predict whether a Store will churn in the next three months. One concern with the current models is that they leverage over 100 variables. This could be troublesome to operationalize. Therefore, the next step is to identify which of the variables are important to the model. Since a model has been created, a technique known as permutation feature importance may be leveraged.
The Permutation Feature Importance module in Azure Machine Learning does not support the custom Precision * Recall metric. Therefore, run the module twice, once for Precision and once for Recall. Then determine which variables do not provide value for Precision or Recall. The results can be found in the Store Churn Data Modeling Results.xlsx file. Here is a summary:
| Variable Name | Precision PFI | Recall PFI |
|---|---|---|
| TotalRevenue | 0.754623 | 0.153846 |
| TotalRevenue_Past3Months | 0.474503 | 0 |
| Status | 0.142292 | 0.061538 |
| Checkins_Evening | 0.100334 | 0 |
| Checkins_Hour6to7_Past3Months | 0.036232 | 0 |
| Checkins_Hour11to12_Past3Months | 0.036232 | 0 |
| ActiveUsers_Last3Months | 0.036232 | 0 |
| AverageMinsPerClass_Service | 0.036232 | 0 |
| Services | 0.036232 | 0 |
| Checkins_Hour8to9 | 0.023411 | 0 |
| Integration_Listen360 | 0.012422 | 0.030769 |
| Forms | 0.012422 | 0.030769 |
| Checkins_Monday | 0.005929 | 0.015385 |
| Checkins_Weekday | 0.005929 | 0.015385 |
| Checkins_Hour14to15 | 0.005929 | 0.015385 |
| Checkins_Morning | 0.005929 | 0.015385 |
| Checkins_Afternoon | 0.005929 | 0.015385 |
| Checkins_Monday_Past3Months | 0.005929 | 0.015385 |
| Checkins_Hour12to13_Past3Months | 0.005929 | 0.015385 |
| Checkins_Evening_Past3Months | 0.005929 | 0.015385 |
| ActiveEmployees_Last3Months | 0.005929 | 0.015385 |
| UserEmployeeRatio_Last3Months | 0.005929 | 0.015385 |
| ActiveEmployees | 0 | 0.015385 |
Any variables not included did not show any importance for either metric. Zeroes are not actually zeroes - they represent very small numbers. When variable size is reduced, the performance of the metrics is reduced. This is seen below.
Full Variable Set
| Test/Training Set | Precision | Recall | Precision * Recall |
|---|---|---|---|
| Set 1 | 0.893 | 0.384 | 0.343 |
Precision/Recall Variable Set
| Test/Training Set | Precision | Recall | Precision * Recall |
|---|---|---|---|
| Set 1 | 0.815 | 0.336 | 0.27384 |
| Set 2 | 0.767 | 0.287 | 0.220129 |
| Set 3 | 0.84 | 0.309 | 0.25956 |
Another analysis was performed including other evaluation metrics, Accuracy and Average Log Loss.
| Variable Name | Precision PFI | Recall PFI | Accuracy PFI | Average Log Loss PFI |
|---|---|---|---|---|
| TotalRevenue | 0.754623 | 0.153846 | 0.041257 | 0.345677 |
| TotalRevenue_Past3Months | 0.474503 | 0 | 0.016699 | 0.032497 |
| Status | 0.142292 | 0.061538 | 0.003438 | 0.08103 |
| Checkins_Evening | 0.100334 | 0 | 0.001473 | 0 |
| Checkins_Hour6to7_Past3Months | 0.036232 | 0 | 0.000491 | 0 |
| Checkins_Hour11to12_Past3Months | 0.036232 | 0 | 0.000491 | 0 |
| ActiveUsers_Last3Months | 0.036232 | 0 | 0.000491 | 0.007357 |
| AverageMinsPerClass_Service | 0.036232 | 0 | 0.000491 | 0.001426 |
| Services | 0.036232 | 0 | 0.000491 | 0 |
| Checkins_Hour8to9 | 0.023411 | 0 | 0 | 0 |
| Integration_Listen360 | 0.012422 | 0.030769 | 0.000982 | 0.005943 |
| Forms | 0.012422 | 0.030769 | 0.000982 | 0.004854 |
| Checkins_Monday | 0.005929 | 0.015385 | 0.000491 | 0.005731 |
| Checkins_Weekday | 0.005929 | 0.015385 | 0.000491 | 0.002713 |
| Checkins_Hour14to15 | 0.005929 | 0.015385 | 0.000491 | 0.003422 |
| Checkins_Morning | 0.005929 | 0.015385 | 0.000491 | 0.003546 |
| Checkins_Afternoon | 0.005929 | 0.015385 | 0.000491 | 0.002234 |
| Checkins_Monday_Past3Months | 0.005929 | 0.015385 | 0.000491 | 0.000259 |
| Checkins_Hour12to13_Past3Months | 0.005929 | 0.015385 | 0.000491 | 0 |
| Checkins_Evening_Past3Months | 0.005929 | 0.015385 | 0.000491 | 0.00133 |
| ActiveEmployees_Last3Months | 0.005929 | 0.015385 | 0.000491 | 0 |
| UserEmployeeRatio_Last3Months | 0.005929 | 0.015385 | 0.000491 | 0 |
| ActiveEmployees | 0 | 0.015385 | 0 | 0.000637 |
| PurchasedSessions | 0 | 0 | 0 | 0.009245 |
| Amenities | 0 | 0 | 0 | 0.007819 |
| StoreAgeMonths | 0 | 0 | 0 | 0.003792 |
| UserEmployeeRatio | 0 | 0 | 0 | 0.002884 |
| Checkins_Thursday_Past3Months | 0 | 0 | 0 | 0.002714 |
| Checkins | 0 | 0 | 0 | 0.002576 |
| Checkins_Hour4to5 | 0 | 0 | 0 | 0.00185 |
| Checkins_Past3Months | 0 | 0 | 0 | 0.001458 |
| Checkins_Hour9to10 | 0 | 0 | 0 | 0.001359 |
| AveragePricePerService | 0 | 0 | 0 | 0.000855 |
| AveragePricePerClass_Service | 0 | 0 | 0 | 0.000808 |
| Checkins_Hour6to7 | 0 | 0 | 0 | 0.000788 |
| Checkins_Hour1to2_Past3Months | 0 | 0 | 0 | 0.000597 |
| Checkins_Hour13to14 | 0 | 0 | 0 | 0.000224 |
| NonRequiredWaiver | 0 | 0 | 0 | 0.000083 |
| Checkins_Hour17to18 | 0 | 0 | 0 | 0.000071 |
| Checkins_Saturday | 0 | 0 | 0 | 0.000013 |
If all of these variables are used, there is some improvement moving closer to the original model.
Full Variable Set (130 Variables)
| Test/Training Set | Precision | Recall | Precision * Recall |
|---|---|---|---|
| Set 1 | 0.893 | 0.384 | 0.343 |
Precision/Recall Variable Set (23 Variables)
| Test/Training Set | Precision | Recall | Precision * Recall |
|---|---|---|---|
| Set 1 | 0.815 | 0.336 | 0.27384 |
| Set 2 | 0.767 | 0.287 | 0.220129 |
| Set 3 | 0.84 | 0.309 | 0.25956 |
Precision/Recall/Accuracy/Average Log Loss Variable Set (40 Variables)
| Test/Training Set | Precision | Recall | Precision * Recall |
|---|---|---|---|
| Set 1 | 0.88 | 0.338 | 0.29744 |
| Set 2 | 0.889 | 0.3 | 0.2667 |
| Set 3 | 0.96 | 0.353 | 0.33888 |
This looks promising. Given the Users data contains all of these Store variables, we will wait until we see the Permutation Feature Importance results for the User Churn scenario before we decide which variables we will keep.
Survival Analysis
Introduction
In the Initial Data Modeling phase for User Churn, the data set does not contain sufficient information for Classification. However, there are other approaches to this type of problem. One such alternative is Survival Analysis.
In Survival Analysis, the goal is to predict Time to Failure. In order to do this, definition for Start Time and End Time must be defined. The End Time is obvious, it is the first day of the Churn Month for that User. For simplicity, it was decided to use the first day of the Activation Month for that User as the Start Time.
In this Survival Analysis, the Cox Proportional Hazards model is used. This model allows passing in a number of numeric variables for each combination of month and user. The model will predict how may months until the User is expected to churn.
Data Preparation
To prepare our data for the Cox Proportional Hazards model, the dataset must be changed slightly. Instead of using the IsChurn, create a new set of variables called MonthsUntilChurn. The SQL for creating this dataset can be found in the [ML].[SA_CompleteUserData] stored procedure in the [Reports] database.
Below is the survival curve for the sampled dataset.
sauserdata <- read_csv("C:/Users/cweaver/Downloads/Users_SA_1M_2017_12_29.csv", col_types = cols(
Store_UserEmployeeRatio = col_double(),
Store_UserEmployeeRatio_Last3Months = col_double(),
Store_AverageMinsPerClass_Service = col_double(),
Store_AveragePricePerClass_Service = col_double(),
Store_AverageCancellationHrsPerClass_Service = col_double(),
Store_AverageRescheduleDeadlinePerClass_Service = col_double(),
Store_AverageMinsPerClass = col_double(),
Store_AveragePricePerClass = col_double(),
Store_AverageCancellationHrsPerClass = col_double(),
Store_AverageRescheduleDeadlinePerClass = col_double(),
Store_AverageMinsPerService = col_double(),
Store_AveragePricePerService = col_double(),
Store_AverageCancellationHrsPerService = col_double(),
Store_AverageRescheduleDeadlinePerService = col_double())
,progress = FALSE
)
sa <- Surv(sauserdata$MonthsUntilChurn)
sa.fit <- survfit(sa ~ 1)
plot(sa.fit, main = "User Churn Survival Curve", xlab = "Months Active", ylab = "Proportion of Users Still Active")Following the same process explore previously, remove non-predictive variables -
ignore.set <- c("UserId", "StoreId", "ReferenceMonthStartDate","IsChurn", "IsChurn_2Months", "IsChurn_3Months", "IsChurn_6Months", "MonthsUntilChurn_RC")
disp <- data.frame(ignore.set)
names(disp) <- "Variables Ignored for Business Reasons"
disp## Variables Ignored for Business Reasons
## 1 UserId
## 2 StoreId
## 3 ReferenceMonthStartDate
## 4 IsChurn
## 5 IsChurn_2Months
## 6 IsChurn_3Months
## 7 IsChurn_6Months
## 8 MonthsUntilChurn_RC
Remove any variables with missing values -
sauserdata.miss <- sauserdata[,-which(names(sauserdata) %in% ignore.set)]
myNumSum <- numSummary(sauserdata.miss)[, c(1,7,8,16,17)]
myNumSum <- tibble::rownames_to_column(myNumSum)
names(myNumSum)[5] <- "missCNT"
names(myNumSum)[1] <- "Variable_Name"
myNumSum <- arrange(myNumSum, desc(n))
ignore.miss <- myNumSum[myNumSum[,5] > 0,1]
disp <- data.frame(ignore.miss)
names(disp) <- c("Variables with Missing Values")
disp## Variables with Missing Values
## 1 Store_UserEmployeeRatio_Last3Months
## 2 Store_UserEmployeeRatio
## 3 Store_AverageMinsPerClass_Service
## 4 Store_AveragePricePerClass_Service
## 5 Store_AverageCancellationHrsPerClass_Service
## 6 Store_AverageRescheduleDeadlinePerClass_Service
## 7 Store_AveragePricePerService
## 8 Store_AverageCancellationHrsPerService
## 9 Store_AverageRescheduleDeadlinePerService
## 10 Store_AverageMinsPerService
## 11 Store_AverageMinsPerClass
## 12 Store_AveragePricePerClass
## 13 Store_AverageCancellationHrsPerClass
## 14 Store_AverageRescheduleDeadlinePerClass
Removed variables with no variance -
sauserdata.var <- sauserdata.miss[,-which(names(sauserdata.miss) %in% ignore.miss)]
ignore.var <- names(which(round(apply(sauserdata.var, MARGIN = 2, FUN = var), 4)<= 0))
disp <- data.frame(ignore.var)
names(disp) <- "Variables with No Variance"
disp## Variables with No Variance
## 1 NonRequiredForms
## 2 NonRequiredWaiver
## 3 NonRequiredHealthHistory
## 4 RequiredHealthHistory
## 5 Store_SMSEnabled
## 6 Store_Integration_FitBPO
## 7 Store_Integration_MotionSoft
## 8 Store_Integration_Paramount
## 9 Store_Integration_ShareYourFit
## 10 Store_Integration_VisualFitnessPlanner
## 11 Store_IntegrationType_LandingPages_Marketing
## 12 Store_IntegrationType_SocialMedia
Perform One-Hot Processing to change any remaining categorical variables into numeric ones. (Again, the results are excessive and not displayed.)
sauserdata.enc <- sauserdata.var[,-which(names(sauserdata.var) %in% ignore.var)]
ignore.char <- row.names(charSummary(sauserdata.enc))
encoder <- onehot(sauserdata.enc[,which(names(sauserdata.enc) %in% ignore.char)], stringsAsFactors = TRUE)
dat.encoded <- predict(encoder, sauserdata.enc)
sauserdata.encoded <- data.frame(sauserdata.enc[,-which(names(sauserdata.enc) %in% ignore.char)], dat.encoded)
glimpse(sauserdata.encoded)Create our testing and training sets -
ind.train <- sample(seq_len(dim(sauserdata.encoded)[1]), floor(dim(sauserdata.encoded)[1] * .7))
sauserdata.train <- sauserdata.encoded[ind.train,]
sauserdata.test <- sauserdata.encoded[-ind.train,]
disp <- as.data.frame(matrix(c(dim(sauserdata.train),dim(sauserdata.test)),ncol=2,byrow=FALSE))
names(disp) <- c("Training set", "Testing Set")
row.names(disp) <- c("Number of Records", "Number of Variables")
disp## Training set Testing Set
## Number of Records 626721 268596
## Number of Variables 200 200
Cox Proportional Hazards Model
Train the survival analysis predictive model.
sa <- Surv(sauserdata.train$MonthsUntilChurn)
cox.form <- formula(paste("sa ~ ", paste(names(sauserdata.train), collapse = " + "), " - MonthsUntilChurn"))
cox.fit <- coxph(cox.form, data = sauserdata.train)Survival Analysis is different than Classification. In Survival Analysis, binomial predictions are not output. Instead, survival curves are provided. Here is a survival curve for one user.
cox.test <- survfit(cox.fit, newdata = sauserdata.test)$surv
plot(cox.test[,1], main = "Survival Curve for User", xlab = "Months Active", ylab = "Probability of Survival", type = "l")
abline(a=.5, b=0, col = "red")The curve is used to determine when the User is likely to churn. In other words, when is the probability of the user churning greater than the probability of the user not churning - when does the probability of survival drop below 0.5?
pred <- function(x = c(), thresh = .5){
y <- order(x<=thresh, x, decreasing = c(TRUE, TRUE))[1]
return(y)
}
sapred <- apply(cox.test, MARGIN = 2, FUN = pred)
saeval <- data.frame(sauserdata.test$MonthsUntilChurn, sapred)
names(saeval) <- c("Actual", "Predicted")
head(saeval)## Actual Predicted
## 2 1 1
## 7 4 1
## 8 1 1
## 11 18 1
## 22 13 1
## 26 15 1
The survival analysis for User Churn did not provide adequate performance to proceed. Valorem submits this is further evidence the data for User Churn does not support and algorithmic solution. More user-centric data is needed.
NEW Data Analysis
This information below was not originally intended in the project SOW. It was created as a natural extension of the Digital Insights Workshop to provide a glimpse of what a modern data platform can provide in terms of additional data insights.
Introduction
In the User Data Modeling phase, it was discovered the User Churn data does not contain patterns or signals strong enough to create a useful algorithmic model. This leads to the question, What data should we have so we can develop a user churn predictive model?
Conclusion
Valorem created a Power BI demonstration to help ClubReady answer this question. The demonstration solution is available for review in the Users Prototype Analysis.pbix file. This was created to introduce ClubReady to the opportunities for data exploration using the simplicity - and power - of Power BI.
The analysis that follows can be reproduced using the PBIX file.
Data
The variables in the [ClubReady].[dbo].[Users], [UserAmenityHistory], [Tags] and [ContractPurchases] tables were evaluated to determine which ones were sufficiently populated and contained information with potentially useful analytic value. The variables include:
- Users Activations
- Users CheckinCredentialsModified, HasDoctor & HasAltPhone
- Users AgeAtActivation & AgeAtChurn
- Users MonthsActive
- Users Address
- Users Referral Type & Prospect Type
- [UserAmenityHistory] AmenityName
- [Tags] Tags
- ContractPurchases HasRequiredSignatures, ElectronicSignature & Membership
- ContractPurchasesPurchasedSessions
- ContractPurchasesElectronicSignaturesTaken
The SQL code can be found in the [PBI].[Users] view within the [Reports] database.
Activations
Activations higher than churn is desirable because that means ClubReady is gaining users. This is the case for the the majority of the time frames shown. [ChurnMonthStartDate] is a potential predictor for User Churn.
Here are a few more trends found by using Power BI:
Binary Users Variables
[CheckinCredentialsModified], [HasDoctor] and [HasAltPhone] variables are potentially useful. However, further inspection shows that [HasDoctor] and [HasAltPhone] variables are too rarely populated to be of use. This leaves [CheckinCredentialsModified].
User Age
Evaluate the User Age variables, AgeAtActivation and AgeAtChurn.
Churn and Activations seem to follow a similar trend with respect to age. Explore further by examining Churn % by Age.
There is a clear trend between Churn and AgeAtActivation.
Months Active
Does this same relationship hold true for Length of Membership?
The top histogram shows a very common problem with certain numeric data, such as date-based and monetary data. This data is generally very right-skewed. A way is needed to create bins that are not equally sized. An easy way to do this is with the logarithm transform. The bottom histogram shows how this creates a much more usable visualization. There’s obviously a relationship between Total Churn and Length of Membership. Now determine if this holds true for Churn %.
Another interesting result! User Churn within the first year is extremely high. However, this drops dramatically as they remain active for subsequent years. [MonthsActive] is also a potentially powerful predictor.
Geography
Evaluate User Address on Churn.
Looking at the charts on the right, it is easy to see that State, Zip and City show high variability in User Churn. This could potentially make these fields useful for modeling. However, it is important to note the size of these variables in the charts on the left. Given the number of distinct values in these variables, it is unlikely using them directly will lead to much power. Instead, this may be a use case for some type of count-based or ratio-based variables. Another thing to consider is that the City names seem very unreliable. Hesitant to call these “high-quality” predictors without being able to subject them to data cleansing.
Referral Type and Prospect Type
The final two variables from the Users table are [Referral Type] and [Prospect Type].
Similar to Geos,[Referral Type]and[Prospect Type]` show high variability for User Churn. However, their high granularity leads us to think that Count-based or Ratio-based variables would be beneficial.
Amenity
The next dataset we want to examine is UserAmenityHistory. This dataset contains all of the Amenities connected to each user. This dataset is quite small and only contains one field of interest, [AmenityName]. Examine the impact Amenity selection has on User Churn.
Amenity selection creates variability for User Churn. Use Power BI’s built-in interactions to see how this plays out within particular stores.
It is reasonable to conclude Amenity is a potential predictor for User Churn, especially in conjunction with Store. However, given it’s high granularity, may need to consider some type of Count-based or Ratio-Based approach.
Tag
Similar to Amenities, this dataset contains all of the Tags associated to each user.
Tags suggest a very similar story to Amenities. However, given the free-form nature of the field and how rarely they are used, it seems unlikely much insight can be derived. This is supported up by the fact the “Tags” field in the Store Churn dataset and it was thrown out by the Feature Selection process.
ContractPurchases
Evaluate ContractPurchases that contains the membership-related purchases for each customer.
Binary ContractPurchases Variables
[HasRequiredSignatures], [ElectronicSignature] and [Membership] all show variability for User Churn. The bottom charts shows these variables are well-populated. The same is not true for [CommissionsOk] and [SMSCheck]. - HasRequiredSignature = TRUE : ElectronicSignature - HasRequiredSignature = TRUE : Membership - ElectronicSignature = TRUE : Membership
It appears [HasRequiredSignatures], [ElectronicSignature] and [Membership] fields all contain similar information. This means they are potentially good predictors.
Purchased Sessions
The major component of the ContractPurchases dataset is [PurchasedSessions]. This informs how active a user is.
Because PurchasedSessions distribution is right-skewed, a log transformation is applied. Logarithm transformation.
Interesting. One way to create useful variables out of counts is to create “buckets”, also known as “discretization”. This chart tells us that we could create the following buckets for [PurchasedSessions], possibly leading to a useful predictor.
- 0 PurchasedSessions
- 1 PurchasedSession
- 2 PurchasedSessions
- 3 PurchasedSessions
- 4 - 40 PurchasedSessions
- More than 40 PurchasedSessions
Review Purchased Sessions over Time to see if there is anything useful.
Curious [PurchasedSessions] follows the same chronological trend as Activations, as opposed to Churn. This is not helpful for modeling, but it does suggest further exploration may be warranted. Additionally, the scatterplot at shows [PurchasedSessions] was positively correlated with Churn in 2016 but negatively in 2017. Again, this is not helpful for modeling, but it is informative and thought-provoking.
It is likely including
PurchasedSessionsand a discretized version in theUser Churndataset may help create a better model.
Electronic Signatures Taken
The final variable in the ContractPurchases dataset is [ElectronicSignaturesTaken]. This variable records the number of signatures required for the user to purchase the contract.
There appears to be a weak correlation between [ElectronicSignaturesTaken] and [Churn %]. However, the number of users in each of the segments greater than 0 is small. Therefore, it may also be beneficial to include a binary version of this variable, [ElectronicSignaturesTaken>0]. Theoretically, this should be the same as the [ElectronicSignature] variable. However, there is a small data quality issue that causes them to be different. It might be worthwhile to investigate this at a later date.
Results
Below are all the fields identified in this section as potentially useful for predicting User Churn. These can be found in the User Churn Data Analysis Results.xlsx file.
| Database | Schema | Table | Field | Final Table | Final Field | Notes |
|---|---|---|---|---|---|---|
| Reports | ML | UserChurn | All | Users | ||
| ClubReady | dbo | Users | CheckinCredentialsModified | Users | CheckinCredentialsModified | |
| ClubReady | dbo | Users | EmergencyContactName | Users | HasEmergencyContact | |
| ClubReady | dbo | Users | EmergencyContactPhone | Users | HasEmergencyContact | |
| ClubReady | dbo | Users | Phone | Users | HasPhone | |
| ClubReady | dbo | Users | CellPhone | Users | HasCellPhone | |
| ClubReady | dbo | ContractPurchases | ActivationDateUtc | Users | AgeAtActivation | |
| ClubReady | dbo | ContractPurchases | AgreedDate | Users | AgeAtActivation | |
| ClubReady | dbo | Users | dob | Users | AgeAtActivation | |
| ClubReady | dbo | Users | (Reused Field) | Users | MonthsActive | |
| ClubReady | dbo | Users | State | Users | State | Needs Count-Based or Ratio-Based Approach |
| ClubReady | dbo | Users | Zip | Users | Zip | Needs Count-Based or Ratio-Based Approach |
| ClubReady | dbo | Users | City | Users | City | Needs Count-Based or Ratio-Based Approach; Needs Data Cleansing |
| ClubReady | dbo | Users | ReferralTypeId | Users | Referral Type | Needs Count-Based or Ratio-Based Approach |
| ClubReady | dbo | ReferralType | ReferralType | Users | Referral Type | Needs Count-Based or Ratio-Based Approach |
| ClubReady | dbo | Users | ProspectTypeId | Users | Prospect Type | Needs Count-Based or Ratio-Based Approach |
| ClubReady | dbo | ProspectType | ProspectType | Users | Prospect Type | Needs Count-Based or Ratio-Based Approach |
| ClubReady | dbo | Amenities | AmenityName | UserAmenityHistory | Amenity | Needs Count-Based or Ratio-Based Approach |
| ClubReady | dbo | ContractPurchases | HasRequiredSignatures | ContractPurchases | HasRequiredSignatures | |
| ClubReady | dbo | ContractPurchases | ElectronicSignature | ContractPurchases | ElectronicSignature | |
| ClubReady | dbo | ContractPurchases | Membership | ContractPurchases | Membership | |
| ClubReady | dbo | ContractPurchases | TotalSessions | ContractPurchases | PurchasedSessions | Discretize to [0, 1, 2, 3, 4-40, >40] |
| ClubReady | dbo | ContractPurchases | ElectronicSignaturesTaken | ContractPurchases | ElectronicSignaturesTaken | Discretize to [0, >0] |
Further Analysis
The tables that could contain potentially useful variables for User Churn are presented below:
- [ClubReady].[dbo].[PurchaseLog]
- [ClubReady].[dbo].Stores
- [ClubReady].[dbo].[Checkinlog_v]
- [ClubReady].[dbo].[UserWaiverRequirement]
- [ClubReady].[dbo].[DeclineLog]
- [ClubReady].[dbo].[bookings]
- [ClubReady].[dbo].[loginlog]
Conclusions & Next Steps
Throughout the Digital Insights Workshop, Valorem and ClubReady collaborated to create a data-driven journey designed to reshape ClubReady into a data first organization.
Store Churn: A prototype “Store Churn” algorithm that predicts when a Store is likely to go out of business in the next three months was created. This is accessible via an Azure Machine Learning Web Service. It is recommended ClubReady analyze the results of this algorithm for the next few months to determine its real-world accuracy. If successful, ClubReady may incorporate the algorithm in its application or use it as a standalone utility. If the algorithm fails to meet expected performance then a targeted analysis to determine it’s deficiencies and corrective action could be undertaken.
User Churn: Less success developing a User Churn algorithm was realized. The data analysis and the modeling strongly suggests the data does not have sufficient user behavioral informative-rich data to build an algorithmic predictive solution. This was discovered during the user churn modeling and confirmed through PCA Analysis. Perhaps in the future ClubReady applications will include more user behavioral data so a user churn model can be developed.
Future Data Exploration: The data exploration and associated R code and the resulting Power BI Desktop file can be used as a guide to design a full-scale analytic platform. This provides ClubReady the opportunity to interactively explore what stories the data might tell thereby unlocking the ability to not only answer questions, but also to understand which questions have yet to be asked. This contributes to faster, accurate and more efficient reporting opportunities. This is a critical step to becoming a data first organization.
Recommendation - Transaction Platform Redesign: ClubReady uses a single Azure IaaS SQL Database to power its application. This is the product of years of agile development and rapidly implemented business features - often customer-specific changes. With the availability of newer technologies, such as Data Lake, this is an opportune time to consider a redesign of the existing system. This will open up new opportunities for more agile and/or continuous integration and development. It will also strengthen ClubReady’s ability to perform routine and ad hoc analyses as part of its Data Analytics journey. This builds the environment where machine learning can become the fabric stitching the data and evolving business requirements into a industry-leading data-first platform.